You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@corinthia.apache.org by jan i <ja...@apache.org> on 2015/08/01 11:33:27 UTC

Zip madness !

Hi

Does anybody know why zip has a mad inefficient directory structure ?

I try to understand the why, but fail.

A zip file, contains 1 global directory with information about every single
file (flat structure, no
sub directories, but filenames may contain a "/"). That is logical and
expected.

BUT in front of every file, there are a local file header, with filename
about 3/4 of the information
from the global directory. This information seems pure redundant and
unneeded.

What am I missing here ? on one of my test docx, the local headers are
about 10% of the filesize (looong filenames) which could be thrown away.

Hope somebody can see what I failed to see.
rgds
jan i.

Re: Zip madness !

Posted by Dave Fisher <da...@comcast.net>.

Zip file format came out of DOS ages in the early 1980s and the PKZip program.

Phil Katz is no longer with us, I heard he passed on in the 80s with uncashed royality checks under his bed.

It is now what it is.

Regards,
Dave

Sent from my iPhone

> On Aug 1, 2015, at 2:33 AM, jan i <ja...@apache.org> wrote:
> 
> Hi
> 
> Does anybody know why zip has a mad inefficient directory structure ?
> 
> I try to understand the why, but fail.
> 
> A zip file, contains 1 global directory with information about every single
> file (flat structure, no
> sub directories, but filenames may contain a "/"). That is logical and
> expected.
> 
> BUT in front of every file, there are a local file header, with filename
> about 3/4 of the information
> from the global directory. This information seems pure redundant and
> unneeded.
> 
> What am I missing here ? on one of my test docx, the local headers are
> about 10% of the filesize (looong filenames) which could be thrown away.
> 
> Hope somebody can see what I failed to see.
> rgds
> jan i.

RE: Zip madness !

Posted by "Dennis E. Hamilton" <de...@acm.org>.

Ah so, <http://people.apache.org/~jani/corinthia_winlibs.zip>.  

Thank you.

 - Dennis

-----Original Message-----
From: jan i [mailto:jani@apache.org] 
Sent: Saturday, August 1, 2015 14:22
To: dev@corinthia.incubator.apache.org; dennis.hamilton@acm.org
Subject: Re: Zip madness !

On Saturday, August 1, 2015, Dennis E. Hamilton <de...@acm.org>
wrote:
[ ... ]
> PS: I don't know where "corinthia_winlibs.zip" sits, so don't know how
> that works.

read building instructions in our wiki, as I announced some days ago.

rgds
jan i

[ ... ]

Re: Zip madness !

Posted by jan i <ja...@apache.org>.

On Saturday, August 1, 2015, Dennis E. Hamilton <de...@acm.org>
wrote:

> The ODF 1.2 specifications are examples of the kind of complex Zip you
> might find.  I think the Part 2 (OpenFormula) specification is particularly
> fruitful, as is Part 1.  Each of these is available as an .odt.
>
> I don't know of comparable OOXML files - those specs are generally only
> available as PDF.
>
> I am not certain what tool would demonstrate appending to one of those
> Zips without rebuilding the entire Zip or somehow obliterating a part being
> replaced.
>
>  - Dennis
>
> PS: I don't know where "corinthia_winlibs.zip" sits, so don't know how
> that works.

read building instructions in our wiki, as I announced some days ago.

rgds
jan i

>
> -----Original Message-----
> From: Peter Kelly [mailto:pmkelly@apache.org <javascript:;>]
> Sent: Saturday, August 1, 2015 10:59
> To: dev@corinthia.incubator.apache.org <javascript:;>
> Subject: Re: Zip madness !
>
> > On 2 Aug 2015, at 12:45 am, jan i <jani@apache.org <javascript:;>>
> wrote:
> >
> >> After fixing this I got a correct directory listing of a test document I
> >> created in Word - I only tested it with one file however, so it may not
> >> address the problem you ran into with the particular test file you
> >> mentioned.
> >>
> > Super, do we have a bigger test document, with loads of files in it ?
>
> I don’t have one handy; we could test this on a larger scale using the zip
> command on Linux/OS X with some directories we create via test a script
> with lots of files (including creating the zip and then latter appending to
> it).
>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org <javascript:;>
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>
>

-- 
Sent from My iPad, sorry for any misspellings.

RE: Zip madness !

Posted by "Dennis E. Hamilton" <de...@acm.org>.

The ODF 1.2 specifications are examples of the kind of complex Zip you might find.  I think the Part 2 (OpenFormula) specification is particularly fruitful, as is Part 1.  Each of these is available as an .odt.

I don't know of comparable OOXML files - those specs are generally only available as PDF.

I am not certain what tool would demonstrate appending to one of those Zips without rebuilding the entire Zip or somehow obliterating a part being replaced.

 - Dennis

PS: I don't know where "corinthia_winlibs.zip" sits, so don't know how that works.  

-----Original Message-----
From: Peter Kelly [mailto:pmkelly@apache.org] 
Sent: Saturday, August 1, 2015 10:59
To: dev@corinthia.incubator.apache.org
Subject: Re: Zip madness !

> On 2 Aug 2015, at 12:45 am, jan i <ja...@apache.org> wrote:
> 
>> After fixing this I got a correct directory listing of a test document I
>> created in Word - I only tested it with one file however, so it may not
>> address the problem you ran into with the particular test file you
>> mentioned.
>> 
> Super, do we have a bigger test document, with loads of files in it ?

I don’t have one handy; we could test this on a larger scale using the zip command on Linux/OS X with some directories we create via test a script with lots of files (including creating the zip and then latter appending to it).

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: Zip madness !

Posted by jan i <ja...@apache.org>.

On 1 August 2015 at 19:59, Peter Kelly <pm...@apache.org> wrote:

> > On 2 Aug 2015, at 12:45 am, jan i <ja...@apache.org> wrote:
> >
> >> After fixing this I got a correct directory listing of a test document I
> >> created in Word - I only tested it with one file however, so it may not
> >> address the problem you ran into with the particular test file you
> >> mentioned.
> >>
> > Super, do we have a bigger test document, with loads of files in it ?
>
> I don’t have one handy; we could test this on a larger scale using the zip
> command on Linux/OS X with some directories we create via test a script
> with lots of files (including creating the zip and then latter appending to
> it).
>
good idea, corinthia_winlibs.zip sits waiting here (4.000+ files 50Mb).


rgds
jan i.


>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
>

Re: Zip madness !

Posted by Peter Kelly <pm...@apache.org>.

> On 2 Aug 2015, at 12:45 am, jan i <ja...@apache.org> wrote:
> 
>> After fixing this I got a correct directory listing of a test document I
>> created in Word - I only tested it with one file however, so it may not
>> address the problem you ran into with the particular test file you
>> mentioned.
>> 
> Super, do we have a bigger test document, with loads of files in it ?

I don’t have one handy; we could test this on a larger scale using the zip command on Linux/OS X with some directories we create via test a script with lots of files (including creating the zip and then latter appending to it).

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

Re: Zip madness !

Posted by jan i <ja...@apache.org>.

On 1 August 2015 at 19:33, Peter Kelly <pm...@apache.org> wrote:

> Hi Jan,
>
> I’ve just fixed one bug I found (was causing a crash; but valgrind helped
> narrow it down) - a DFextZipDirEntry pointer was being set via incorrect
> pointer entry (see my commit to the newZipExperiment branch for details).
>
thanks I have had this also but not constantly. I will pull your fix,
before I change branch.



>
> After fixing this I got a correct directory listing of a test document I
> created in Word - I only tested it with one file however, so it may not
> address the problem you ran into with the particular test file you
> mentioned.
>
Super, do we have a bigger test document, with loads of files in it ?

rgds
jan i.


>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
> > On 1 Aug 2015, at 10:41 pm, Peter Kelly <pm...@apache.org> wrote:
> >
> > Hi  Jan,
> >
> > I’ll get to your question in a moment, but I just checked out the
> newZipExperiment branch and noticed that almost all of the source files
> have changed (I was expecting a relatively small diff, with only a few
> files changed). It looks like most of these differences are due to
> reordering the #includes at the top of each source file. If we’re going to
> do this, could we make it a separate commit in master, so it’s easier to
> see exactly what has changed in the zip branch?
> >
> > Actually I normally intentionally put system headers after other headers
> in the project, as it helps to detect cases where a custom header depends
> on types declared in a system header, and thus for which importing that
> header (by itself) in a source file would result in compilation errors due
> to the missing references. For example DFBuffer.h has an #include
> <stdarg.h> at the type since some of the functions take the va_list data
> type. If one of us uses such this type in another header which doesn’t have
> #include <stdarg.h>, then any C file that imports it (directly or
> indirectly) has to remember to explicitly include stdarg.h (and that could
> be a *lot* of files, if the header is referenced from lots of places). So
> by placing the any system includes needed by the source file after all
> custom headers, we can pick up on these errors more easily.
> >
> > Regarding the zip file format, I need to look up on some stuff and will
> get back to you shortly. But I suspect some of the duplication may be
> related to the fact that a zip file is meant to be read backwards. Rather
> than starting at the beginning of the file, reading begins at the end,
> working backwards through the file to find potentially multiple copies of
> the directory listing. This serves two purposes:
> >
> > 1) You can “modify” the contents of a zip file simply by appending (with
> the compressed content of new/changed files added, and a new directory
> listing including these files, an *not* including any files which have been
> “deleted”, i.e. masked out).
> >
> > 2) A zip file can be appended to the end of another file format; the
> most common example being self-extracting .exe files. Since .exe files are
> read from the beginning, the program loader on windows doesn’t care about
> the fact that there’s the trailing data at the end. And it’s still a valid
> zip file, since the .exe content at the start is ignored when reading the
> directory listing.
> >
> > I think you may be aware of some of these details already, and there’s
> some nuances I’ve probably missed. I’m about to have a look through the
> code you currently have in the branch.
> >
> > —
> > Dr Peter M. Kelly
> > pmkelly@apache.org
> >
> > PGP key: http://www.kellypmk.net/pgp-key <
> http://www.kellypmk.net/pgp-key>
> > (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> >
> >> On 1 Aug 2015, at 4:33 pm, jan i <ja...@apache.org> wrote:
> >>
> >> Hi
> >>
> >> Does anybody know why zip has a mad inefficient directory structure ?
> >>
> >> I try to understand the why, but fail.
> >>
> >> A zip file, contains 1 global directory with information about every
> single
> >> file (flat structure, no
> >> sub directories, but filenames may contain a "/"). That is logical and
> >> expected.
> >>
> >> BUT in front of every file, there are a local file header, with filename
> >> about 3/4 of the information
> >> from the global directory. This information seems pure redundant and
> >> unneeded.
> >>
> >> What am I missing here ? on one of my test docx, the local headers are
> >> about 10% of the filesize (looong filenames) which could be thrown away.
> >>
> >> Hope somebody can see what I failed to see.
> >> rgds
> >> jan i.
> >
>
>

Re: Zip madness !

Posted by Peter Kelly <pm...@apache.org>.

Hi Jan,

I’ve just fixed one bug I found (was causing a crash; but valgrind helped narrow it down) - a DFextZipDirEntry pointer was being set via incorrect pointer entry (see my commit to the newZipExperiment branch for details).

After fixing this I got a correct directory listing of a test document I created in Word - I only tested it with one file however, so it may not address the problem you ran into with the particular test file you mentioned.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

> On 1 Aug 2015, at 10:41 pm, Peter Kelly <pm...@apache.org> wrote:
> 
> Hi  Jan,
> 
> I’ll get to your question in a moment, but I just checked out the newZipExperiment branch and noticed that almost all of the source files have changed (I was expecting a relatively small diff, with only a few files changed). It looks like most of these differences are due to reordering the #includes at the top of each source file. If we’re going to do this, could we make it a separate commit in master, so it’s easier to see exactly what has changed in the zip branch?
> 
> Actually I normally intentionally put system headers after other headers in the project, as it helps to detect cases where a custom header depends on types declared in a system header, and thus for which importing that header (by itself) in a source file would result in compilation errors due to the missing references. For example DFBuffer.h has an #include <stdarg.h> at the type since some of the functions take the va_list data type. If one of us uses such this type in another header which doesn’t have #include <stdarg.h>, then any C file that imports it (directly or indirectly) has to remember to explicitly include stdarg.h (and that could be a *lot* of files, if the header is referenced from lots of places). So by placing the any system includes needed by the source file after all custom headers, we can pick up on these errors more easily.
> 
> Regarding the zip file format, I need to look up on some stuff and will get back to you shortly. But I suspect some of the duplication may be related to the fact that a zip file is meant to be read backwards. Rather than starting at the beginning of the file, reading begins at the end, working backwards through the file to find potentially multiple copies of the directory listing. This serves two purposes:
> 
> 1) You can “modify” the contents of a zip file simply by appending (with the compressed content of new/changed files added, and a new directory listing including these files, an *not* including any files which have been “deleted”, i.e. masked out).
> 
> 2) A zip file can be appended to the end of another file format; the most common example being self-extracting .exe files. Since .exe files are read from the beginning, the program loader on windows doesn’t care about the fact that there’s the trailing data at the end. And it’s still a valid zip file, since the .exe content at the start is ignored when reading the directory listing.
> 
> I think you may be aware of some of these details already, and there’s some nuances I’ve probably missed. I’m about to have a look through the code you currently have in the branch.
> 
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
> 
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
> 
>> On 1 Aug 2015, at 4:33 pm, jan i <ja...@apache.org> wrote:
>> 
>> Hi
>> 
>> Does anybody know why zip has a mad inefficient directory structure ?
>> 
>> I try to understand the why, but fail.
>> 
>> A zip file, contains 1 global directory with information about every single
>> file (flat structure, no
>> sub directories, but filenames may contain a "/"). That is logical and
>> expected.
>> 
>> BUT in front of every file, there are a local file header, with filename
>> about 3/4 of the information
>> from the global directory. This information seems pure redundant and
>> unneeded.
>> 
>> What am I missing here ? on one of my test docx, the local headers are
>> about 10% of the filesize (looong filenames) which could be thrown away.
>> 
>> Hope somebody can see what I failed to see.
>> rgds
>> jan i.
>

Re: Zip madness !

Posted by jan i <ja...@apache.org>.

On 1 August 2015 at 17:41, Peter Kelly <pm...@apache.org> wrote:

> Hi  Jan,
>
> I’ll get to your question in a moment, but I just checked out the
> newZipExperiment branch and noticed that almost all of the source files
> have changed (I was expecting a relatively small diff, with only a few
> files changed). It looks like most of these differences are due to
> reordering the #includes at the top of each source file. If we’re going to
> do this, could we make it a separate commit in master, so it’s easier to
> see exactly what has changed in the zip branch?
>
We are not going to do this, it was me being religious for a moment.

I need a FILE * in the zipHandle structure, and did not like void *

>
> Actually I normally intentionally put system headers after other headers
> in the project, as it helps to detect cases where a custom header depends
> on types declared in a system header, and thus for which importing that
> header (by itself) in a source file would result in compilation errors due
> to the missing references. For example DFBuffer.h has an #include
> <stdarg.h> at the type since some of the functions take the va_list data
> type. If one of us uses such this type in another header which doesn’t have
> #include <stdarg.h>, then any C file that imports it (directly or
> indirectly) has to remember to explicitly include stdarg.h (and that could
> be a *lot* of files, if the header is referenced from lots of places). So
> by placing the any system includes needed by the source file after all
> custom headers, we can pick up on these errors more easily.
>
This is actually how we agreed on it, you will see a newExperiment2 without
these many changes.


>
> Regarding the zip file format, I need to look up on some stuff and will
> get back to you shortly. But I suspect some of the duplication may be
> related to the fact that a zip file is meant to be read backwards. Rather
> than starting at the beginning of the file, reading begins at the end,
> working backwards through the file to find potentially multiple copies of
> the directory listing. This serves two purposes:
>
> 1) You can “modify” the contents of a zip file simply by appending (with
> the compressed content of new/changed files added, and a new directory
> listing including these files, an *not* including any files which have been
> “deleted”, i.e. masked out).
>
> 2) A zip file can be appended to the end of another file format; the most
> common example being self-extracting .exe files. Since .exe files are read
> from the beginning, the program loader on windows doesn’t care about the
> fact that there’s the trailing data at the end. And it’s still a valid zip
> file, since the .exe content at the start is ignored when reading the
> directory listing.
>
> I think you may be aware of some of these details already, and there’s
> some nuances I’ve probably missed. I’m about to have a look through the
> code you currently have in the branch.
>
Painfully aware. I am slowly including code from an old project of mine,
which is soo old that I have forgotten why I did things.

I expect to have the open/read part finished in some hours, otherwise it
will be delayed to monday.

I also have an experimental write (only local, not committed).

Thanks for taking a look. I will write in here when I consider the
open/read ready for master. I would like to move that to master before I do
the write part.

rgds
jan i.

>
> —
> Dr Peter M. Kelly
> pmkelly@apache.org
>
> PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
> (fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)
>
> > On 1 Aug 2015, at 4:33 pm, jan i <ja...@apache.org> wrote:
> >
> > Hi
> >
> > Does anybody know why zip has a mad inefficient directory structure ?
> >
> > I try to understand the why, but fail.
> >
> > A zip file, contains 1 global directory with information about every
> single
> > file (flat structure, no
> > sub directories, but filenames may contain a "/"). That is logical and
> > expected.
> >
> > BUT in front of every file, there are a local file header, with filename
> > about 3/4 of the information
> > from the global directory. This information seems pure redundant and
> > unneeded.
> >
> > What am I missing here ? on one of my test docx, the local headers are
> > about 10% of the filesize (looong filenames) which could be thrown away.
> >
> > Hope somebody can see what I failed to see.
> > rgds
> > jan i.
>
>

RE: Zip madness !

Posted by "Dennis E. Hamilton" <de...@acm.org>.

I've not looked at the Self-Extractor code that is prefixed so I can't comment on how that is done.  It depends on how links from the "central" directory to the individual "local" directories and their streams are done. I have forgotten the details.

I am not certain that (1) works as described.  I don't recall any mention of it in the Zip specification.  I must look for that too.

Most of the Zip tools have a test mode where the archive integrity is assessed.  It would be useful to see what they do when there are duplicate local file directories with the same name.

 - Dennis

-----Original Message-----
From: Peter Kelly [mailto:pmkelly@apache.org] 
Sent: Saturday, August 1, 2015 08:42
To: dev@corinthia.incubator.apache.org
Subject: Re: Zip madness !

[ ... ]
1) You can “modify” the contents of a zip file simply by appending (with the compressed content of new/changed files added, and a new directory listing including these files, an *not* including any files which have been “deleted”, i.e. masked out).

2) A zip file can be appended to the end of another file format; the most common example being self-extracting .exe files. Since .exe files are read from the beginning, the program loader on windows doesn’t care about the fact that there’s the trailing data at the end. And it’s still a valid zip file, since the .exe content at the start is ignored when reading the directory listing.

I think you may be aware of some of these details already, and there’s some nuances I’ve probably missed. I’m about to have a look through the code you currently have in the branch.

[ ... ]

Re: Zip madness !

Posted by Peter Kelly <pm...@apache.org>.

Hi  Jan,

I’ll get to your question in a moment, but I just checked out the newZipExperiment branch and noticed that almost all of the source files have changed (I was expecting a relatively small diff, with only a few files changed). It looks like most of these differences are due to reordering the #includes at the top of each source file. If we’re going to do this, could we make it a separate commit in master, so it’s easier to see exactly what has changed in the zip branch?

Actually I normally intentionally put system headers after other headers in the project, as it helps to detect cases where a custom header depends on types declared in a system header, and thus for which importing that header (by itself) in a source file would result in compilation errors due to the missing references. For example DFBuffer.h has an #include <stdarg.h> at the type since some of the functions take the va_list data type. If one of us uses such this type in another header which doesn’t have #include <stdarg.h>, then any C file that imports it (directly or indirectly) has to remember to explicitly include stdarg.h (and that could be a *lot* of files, if the header is referenced from lots of places). So by placing the any system includes needed by the source file after all custom headers, we can pick up on these errors more easily.

Regarding the zip file format, I need to look up on some stuff and will get back to you shortly. But I suspect some of the duplication may be related to the fact that a zip file is meant to be read backwards. Rather than starting at the beginning of the file, reading begins at the end, working backwards through the file to find potentially multiple copies of the directory listing. This serves two purposes:

1) You can “modify” the contents of a zip file simply by appending (with the compressed content of new/changed files added, and a new directory listing including these files, an *not* including any files which have been “deleted”, i.e. masked out).

2) A zip file can be appended to the end of another file format; the most common example being self-extracting .exe files. Since .exe files are read from the beginning, the program loader on windows doesn’t care about the fact that there’s the trailing data at the end. And it’s still a valid zip file, since the .exe content at the start is ignored when reading the directory listing.

I think you may be aware of some of these details already, and there’s some nuances I’ve probably missed. I’m about to have a look through the code you currently have in the branch.

—
Dr Peter M. Kelly
pmkelly@apache.org

PGP key: http://www.kellypmk.net/pgp-key <http://www.kellypmk.net/pgp-key>
(fingerprint 5435 6718 59F0 DD1F BFA0 5E46 2523 BAA1 44AE 2966)

> On 1 Aug 2015, at 4:33 pm, jan i <ja...@apache.org> wrote:
> 
> Hi
> 
> Does anybody know why zip has a mad inefficient directory structure ?
> 
> I try to understand the why, but fail.
> 
> A zip file, contains 1 global directory with information about every single
> file (flat structure, no
> sub directories, but filenames may contain a "/"). That is logical and
> expected.
> 
> BUT in front of every file, there are a local file header, with filename
> about 3/4 of the information
> from the global directory. This information seems pure redundant and
> unneeded.
> 
> What am I missing here ? on one of my test docx, the local headers are
> about 10% of the filesize (looong filenames) which could be thrown away.
> 
> Hope somebody can see what I failed to see.
> rgds
> jan i.

RE: Zip madness !

Posted by "Dennis E. Hamilton" <de...@acm.org>.

It doesn't matter.  The structure of Zip archive files is what it is and it is being used on the document formats that interest us.  We have no choice in the matter [;<).

There are profiles of Zip when it is employed as a carrier for standard document-file formats.  It is important to know (1) the Zip specification that is the basis for a standard format and (2) the profile that is used.  That applies to OOXML (in the OPC portion of the spec) and ODF (in Part 3 of ODF 1.2, section 17 of ODF 1.1). It applies for ePub also.  There is also now a common ISO profile of Zip that is intended to provide a progression of layers for use in support of document-format specifications.  
 
 - Dennis

SOME BACKGROUND

The local file headers are often produced serially as the archive is built and are there for serial processing of the Zip on structures that do not allow random access into the stream.  (OPC has a level of abstraction that allows more-efficient streaming over networks and in cloud applications but I don't know how much that is exploited outside of Microsoft products.  You may find it interesting to know that Visual Studio employs OPC in a variety of ways in carrying development artifacts.)

The global directory, at the end is a cross check and, for positionable streams, an additional support for ensuring that the Zip has not been damaged.  In some cases, the global directory has more information than local file headers, since such details might only be known after the local file stream has been produced (checksums for example, even the length of a stream), and the global forms can employ larger pointers and sizes than can be used in the local file headers.  The global directory might also be usable in recovery of data from a damaged Zip for which an intact global directory is still present.  For programs on modern file systems, I suspect that the global directory is used almost exclusively, although the local file headers are still there, and correct.  In fact, some programs "sniff" the first local file header of ODF packages to detect the "mimetype" file entry, although it is not required that it be the first local file header.

I find all of this intriguing, myself.  It is a challenge to provide a durable model that delivers an useful API above the physical Zip structure that adapts to available capabilities and removes concern for such details, allowing isolation under a better abstraction for use on behalf of a document format.

-----Original Message-----
From: jan i [mailto:jani@apache.org] 
Sent: Saturday, August 1, 2015 02:33
To: dev@corinthia.incubator.apache.org
Subject: Zip madness !

Hi

Does anybody know why zip has a mad inefficient directory structure ?

I try to understand the why, but fail.

A zip file, contains 1 global directory with information about every single
file (flat structure, no
sub directories, but filenames may contain a "/"). That is logical and
expected.

BUT in front of every file, there are a local file header, with filename
about 3/4 of the information
from the global directory. This information seems pure redundant and
unneeded.

What am I missing here ? on one of my test docx, the local headers are
about 10% of the filesize (looong filenames) which could be thrown away.

Hope somebody can see what I failed to see.
rgds
jan i.