You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2009/02/13 12:47:50 UTC

[compress] ZIP - encoding of file names - again

Let me try to capture the various threads in SANDBOX-176 and from this
list into something we can draw conclusions from.

First some background:
======================

when I implemented the ZIP classes for Ant, I was working from
InfoZIP's documentation of the format, not PKWARE's, I've now read the
later as well and learned a few new things.

In general the ZIP format as defined by PKWARE uses CodePage 437 for
filenames and the ZIP comment.  Initially it didn't say so explicitly,
this is just the way it was.  And of course CP437 isn't good enough
for arbitrary file names, so people simply started to use different
encodings - like java.util.zip which uses UTF-8.

Later revisions of the spec introduce a new flag that can be set to
indicate that a filename is encoded using UTF-8, the EFS flag.

According to Wolfgang's tests in SANDBOX-176 the flag is honored by
WinZIP and 7Zip while reading and I've checked that InfoZIP's unzip
5.x deals with them as well.  7Zip uses it for writing file names,
WinZIP doesn't and InfoZIP's zip 3.x may or may not write it depending
on compilation options.

InfoZIP introduces two new extra fields that can hold UTF-8 encoded
versions of the file names and the archive comments.  Extra fields in
ZIP archives hold additional data that are supposed to be ignored by
archivers that don't understand them.

Wolfgang's tests indicate that WinZip reads and writes the extra
field and 7Zip doesn't support them.  The InfoZIP tools certainly
support them.

Windows' built-in ZIP lib doesn't support either approach.

Supporting new extra fields is simple using the existing compress code
base, and supporting the EFS flag isn't that hard either - SANDBOX-176
contains the necessary code that really only needs to be tweaked to
conform to what we want to do by default.

Reading
=======

Let's keep ZipArchiveInputStream out of the discussion for now 8-)

I propose to change ZipFile to support both the EFS flag as well as
the InfoZIP extra fields when reading archives.

I'm not sure what ZipFile should do if it encounters both the EFS flag
and the extra fields.  Likely it is best to assume both hold the same
information and simply use the EFS encoded name.

The question is what ZipFile should assume as its default if neither
the EFS nor extra fields are present.  This can be controlled by
"setEncoding" right now and defaults to the platform's default
encoding but a default of UTF-8 (compatible with java.util.zip) or
CodePage 437 (compatible with formal ZIP spec) are valid choices as
well.

Writing
=======

I propose new flags get/setLanguageEncodingFlag for EFS and
get/setAddUnicodeExtraFields on ZipArchiveOutputStream that control
whether either approach is used.  I.e. I propose to optionally support
either approach (and both at the same time).

IMHO the main question is what the code should do by default.

Currently I think the best default approach would be to use UTF-8 as
the default encoding and set the EFS bit since this will create
archives compatible with java.util.zip but has the additional benefit
of clearly stating it is using UTF-8.

Note that using the EFS bit may make the archive unreadable for old
archivers, that's why we need the option to turn it off.

I wouldn't add the InfoZIP extra fields by default since they increase
the archve size.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-02-19, Stefan Bodewig <bo...@apache.org> wrote:

> On 2009-02-18, Wolfgang Glas <wo...@ev-i.at> wrote:

>> A rudimentary test is in my original patch as attached to SANDBOX-176.
>> I have refactored this test to the current SVN revision an attached to this
>> mail.

> Thanks, I've modified it quite a bit, but its guts are still intact.

Oh, I had to make the tests conditional on the encoding since cp437
isn't supported on JDK 1.4 for Windows (works for Java6).

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-02-18, Wolfgang Glas <wo...@ev-i.at> wrote:

> Stefan Bodewig schrieb:

>> Wolfgang, you may notice a few minor tweaks to your original code.  Do
>> you happen to have stand-alone tests for the Unicode extra fields
>> anywhere?

> A rudimentary test is in my original patch as attached to SANDBOX-176.
> I have refactored this test to the current SVN revision an attached to this
> mail.

Thanks, I've modified it quite a bit, but its guts are still intact.
Committed.

> The test needs either be refactored to use ZipFile or
> ZipArchiveInputStream has to be implemented ;-)

ZipFile it is.

> I also had to expose the ZipEncodingHelper functionality to the
> public in order to compile the new test.

Just because you didn't add the test to the zip package 8-)

> You also need the two zip files attached to SANBOX-176 in
> src/test/resources in order to run the interoperability test.

They'll be added once EFS support is in.

I've added you as contributor to the POM, let me know if you want me
to remove the email address or use a different one.

Thanks!

        Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> I started to take some baby steps implementing it, in particular
> 
> On 2009-02-13, Stefan Bodewig <bo...@apache.org> wrote:
> 
>> Currently I think the best default approach would be to use UTF-8 as
>> the default encoding and set the EFS bit since this will create
>> archives compatible with java.util.zip but has the additional benefit
>> of clearly stating it is using UTF-8.
> 
> UTF-8 is now the default for ZipArchiveOutputStream and ZipFile, EFS
> support is not yet in.
> 
> The InfoZIP extra fields are supported, but one has to write them
> manually right now.  They should be read transparently by ZipFile but
> don't affect the file name or comment ATM.
> 
> Wolfgang, you may notice a few minor tweaks to your original code.  Do
> you happen to have stand-alone tests for the Unicode extra fields
> anywhere?

A rudimentary test is in my original patch as attached to SANDBOX-176.
I have refactored this test to the current SVN revision an attached to this
mail. The test needs either be refactored to use ZipFile or
ZipArchiveInputStream has to be implemented ;-)

  I also had to expose the ZipEncodingHelper functionality to the public in
order to compile the new test.

  You also need the two zip files attached to SANBOX-176 in src/test/resources
in order to run the interoperability test.

> I took the liberty to apply the same patches to Ant trunk as well.

Nice to see ;-)

  Best regards,

     Wolfgang

Re: [compress] ZIP - encoding of file names - again

Posted by Torsten Curdt <tc...@apache.org>.
Cool!

On Fri, Feb 20, 2009 at 10:03, Stefan Bodewig <bo...@apache.org> wrote:
> On 2009-02-19, Stefan Bodewig <bo...@apache.org> wrote:
>
>> On 2009-02-18, Stefan Bodewig <bo...@apache.org> wrote:
>
>>> UTF-8 is now the default for ZipArchiveOutputStream and ZipFile, EFS
>>> support is not yet in.
>
>> Now it is.
>
> Just a quick update.  Since I merged the EFS support into Ant's code
> base as well, all JARs created by Ant now have the EFS-flag set - and
> given that there have been no new failures in last night's Gump runs
> we can at least conclude that the bit doesn't affect JAR-file
> interoperability with java.util.zip at all.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
> For additional commands, e-mail: dev-help@commons.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-02-19, Stefan Bodewig <bo...@apache.org> wrote:

> On 2009-02-18, Stefan Bodewig <bo...@apache.org> wrote:

>> UTF-8 is now the default for ZipArchiveOutputStream and ZipFile, EFS
>> support is not yet in.

> Now it is.

Just a quick update.  Since I merged the EFS support into Ant's code
base as well, all JARs created by Ant now have the EFS-flag set - and
given that there have been no new failures in last night's Gump runs
we can at least conclude that the bit doesn't affect JAR-file
interoperability with java.util.zip at all.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-02-18, Stefan Bodewig <bo...@apache.org> wrote:

> UTF-8 is now the default for ZipArchiveOutputStream and ZipFile, EFS
> support is not yet in.

Now it is.  I combined Wolfgang's patch for EFS with one by TAMURA
Kent to Ant https://issues.apache.org/bugzilla/show_bug.cgi?id=45548
and shuffled a bit of code around in addition.

Tests are in the work, manual tests look OK and I guess the more
extensive tests in Ant will tell me what I've broken.

So now ZipArchiveOutputStream writes UTF8 by default (like
java.util.zip) but also sets the EFS flag.  The EFS flag can be
disabled.

ZipFile detects the EFS flag and automatically parses file names as
UTF8 if it is set.

The only remaining missing piece now is ZipFile recognizing the
InfoZIP extra fields if no EFS-Flag is there and
ZipArchiceOutputStream optionally creating them.

And documentation, of course.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Stefan Bodewig <bo...@apache.org>.
I started to take some baby steps implementing it, in particular

On 2009-02-13, Stefan Bodewig <bo...@apache.org> wrote:

> Currently I think the best default approach would be to use UTF-8 as
> the default encoding and set the EFS bit since this will create
> archives compatible with java.util.zip but has the additional benefit
> of clearly stating it is using UTF-8.

UTF-8 is now the default for ZipArchiveOutputStream and ZipFile, EFS
support is not yet in.

The InfoZIP extra fields are supported, but one has to write them
manually right now.  They should be read transparently by ZipFile but
don't affect the file name or comment ATM.

Wolfgang, you may notice a few minor tweaks to your original code.  Do
you happen to have stand-alone tests for the Unicode extra fields
anywhere?

I took the liberty to apply the same patches to Ant trunk as well.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-02-13, Wolfgang Glas <wo...@ev-i.at> wrote:

> Stefan Bodewig schrieb:

>> Reading
>> =======

>> The question is what ZipFile should assume as its default if neither
>> the EFS nor extra fields are present.  This can be controlled by
>> "setEncoding" right now and defaults to the platform's default
>> encoding but a default of UTF-8 (compatible with java.util.zip) or
>> CodePage 437 (compatible with formal ZIP spec) are valid choices as
>> well.

> AFAIKS, ant API user are used to the 'setEncoding(String encoding)'
> approach although it yould be better to rename the method to
> 'setDefaultEncoding(String encoding)'.

Ant is a different concern since Ant has to keep its API backwards
compatible.  commmonscompress is free to break it, and I'm prepared to
do that (and live with more merge conflicts when shipping changes
between compress and Ant).

I agree that setDefaultEncoding would be a better name.

>> Writing
>> =======

>> I propose new flags get/setLanguageEncodingFlag for EFS and
>> get/setAddUnicodeExtraFields on ZipArchiveOutputStream that control
>> whether either approach is used.  I.e. I propose to optionally support
>> either approach (and both at the same time).

> The question at this point is, whther to us the EFS flag for *all* records* or
> only for records not encodable by the encoding set by 'setEncoding(String)'.

> IMHO we should tke over the 7-zip approach and set the EFS flag only for
> not-encodable records, since this approch is mininimally invasive.

Works for me.

>> IMHO the main question is what the code should do by default.

>> Currently I think the best default approach would be to use UTF-8 as
>> the default encoding and set the EFS bit since this will create
>> archives compatible with java.util.zip but has the additional benefit
>> of clearly stating it is using UTF-8.

> Yes, this seems to be reasonable, because users will expect JAVA-compatibility
> in the first instance.

>> Note that using the EFS bit may make the archive unreadable for old
>> archivers, that's why we need the option to turn it off.

> I've not seen an old archiver you refused to unpack such a file.

There is a warning and clearly the general purpose bit will be set to
a value some archivers don't understand.  If they ignore the fact,
that is fine.

> How about my suggestion for a 'tuning' method, sets up the
> ZipOutputStream in a way, that's suitable for most unzip tools out
> in the wild?

I'm not sure whether we want to encode such magic.

> Or sould we gather all the knowledge we gathered in SANDBOX-176 an in this
> thread into the JavaDoc of the class ?

Yes, we should.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] ZIP - encoding of file names - again

Posted by Wolfgang Glas <wo...@ev-i.at>.
Hi Stefan,

  My comments follow.

Stefan Bodewig schrieb:
> Let me try to capture the various threads in SANDBOX-176 and from this
> list into something we can draw conclusions from.
> 
> First some background:
> ======================

[snip]

> Reading
> =======
> 
> Let's keep ZipArchiveInputStream out of the discussion for now 8-)

Yes, we should do so. I analysed my winzip example and recognized, that unicode
extra fields are written to the central directory records and not to the local
file headers. This makes it impossible to get the real Unicode filename when
parsing a ZIP file in the way as all ZipInputStream implementations I've seen
do. (They sequentially parse the local file headers and ignore the central
directory records...)

Furthermore, relicensing of any GPL-version of java.util.zip.ZipInputStream
version seems to be impossible, because of the large number of contributors to
the code out there. (I've tried to find the contributors to GNU classpath'
version, there's nearly no possiblity to find them all...)

> I propose to change ZipFile to support both the EFS flag as well as
> the InfoZIP extra fields when reading archives.

That's a good choice. I'Ve already provided the parsing code for unicode extrra
fields, so the implementation should be quite easy ;-)

> I'm not sure what ZipFile should do if it encounters both the EFS flag
> and the extra fields.  Likely it is best to assume both hold the same
> information and simply use the EFS encoded name.

Agreed.

> The question is what ZipFile should assume as its default if neither
> the EFS nor extra fields are present.  This can be controlled by
> "setEncoding" right now and defaults to the platform's default
> encoding but a default of UTF-8 (compatible with java.util.zip) or
> CodePage 437 (compatible with formal ZIP spec) are valid choices as
> well.

AFAIKS, ant API user are used to the 'setEncoding(String encoding)' approach
although it yould be better to rename the method to 'setDefaultEncoding(String
encoding)'.

> Writing
> =======
> 
> I propose new flags get/setLanguageEncodingFlag for EFS and
> get/setAddUnicodeExtraFields on ZipArchiveOutputStream that control
> whether either approach is used.  I.e. I propose to optionally support
> either approach (and both at the same time).

The question at this point is, whther to us the EFS flag for *all* records* or
only for records not encodable by the encoding set by 'setEncoding(String)'.

IMHO we should tke over the 7-zip approach and set the EFS flag only for
not-encodable records, since this approch is mininimally invasive.

Surely the EFS flag should be set for all records, if the encoding is set to utf-8.

> IMHO the main question is what the code should do by default.
> 
> Currently I think the best default approach would be to use UTF-8 as
> the default encoding and set the EFS bit since this will create
> archives compatible with java.util.zip but has the additional benefit
> of clearly stating it is using UTF-8.

Yes, this seems to be reasonable, because users will expect JAVA-compatibility
in the first instance.

> Note that using the EFS bit may make the archive unreadable for old
> archivers, that's why we need the option to turn it off.

I've not seen an old archiver you refused to unpack such a file. The only
problem is, that the file names of the unpacked files are wrong. (utf-8
interpreted as CP437, the good news is: All codepoints from 0x80-0xff in CP437
are allocated) However, that's the same problem as arises when unpacking a file
created by java.util.zip.ZipOutputStream.

> I wouldn't add the InfoZIP extra fields by default since they increase
> the archve size.

Yes, that' good so.

How about my suggestion for a 'tuning' method, sets up the ZipOutputStream in a
way, that's suitable for most unzip tools out in the wild?

Or sould we gather all the knowledge we gathered in SANDBOX-176 an in this
thread into the JavaDoc of the class ?

  Regards,

    Wolfgang


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org