You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ant.apache.org by Stefan Bodewig <bo...@apache.org> on 2009/02/26 15:16:26 UTC

encoding in ZIP package

Hi all,

over the past two weeks commons-compress has been adding stuff for
more advanced ZIP features and I've merged the changes over to our zip
package.  The changes bring two new options with them and I'd like to
get some feedback as to which defaults our tasks should use wrt these
options.

First some background:

Traditionally file names are encoded using Windows CodePage 437 inside
ZIP archives.  This is insufficient for many characters and thus
people have chosen multiple incompatible ways to use different
encodings.  jar uses UTF-8.  Ant's tasks provide options to set the
encoding when reading/writing archives and defaults to the platform's
default encoding for zip/unzip or UTF-8 for jar/unjar.

Now the new stuff.

Language Encoding Flag
----------------------

PKWARE as the definer of the ZIP standard have desiganted a bit inside
the "general purpose bits" part of the entry's metadata to say "my
file name is in UTF-8".  This flag is recognized by more modern PKWARE
archivers, 7ZIP and very recent InfoZIP tools (if compiled using the
correct options).  7ZIP creates archives using that flag.

WinZIP and Windows' "compressed folders" completely ignore the flag.

The ZipOutputStream code right now sets the flag if encoding is UTF-8
(i.e. we are writing JARs) which makes those who understand it
immediately pick up the correct file names.  Those who don't know the
flag are no better off than before - java.util.zip seems to be happy
with and without the flag.

The ZipFile code right now recognizes the flag and ignores any
explicitly specified encoding if the flag is set - and uses UTF-8
instead, assuming the archiver knew what it has been doing.

I think either are fine defaults and I'm not even sure we need to make
them user configurable on the reading side.  We may add an option on
the writing side if there is some rare archiver that chokes on an
unknown bit in the general purpose bit area.

InfoZip Unicode Extra Fields
----------------------------

The InfoZIP folks have defined new ZIP extra fields that store UTF-8
versions of file names and comments in the entry's metadata - no
matter what the encoding of the normal name and comment fields may be.

PKWARE and WinZIP recognize these extra fields, 7ZIP and Windows'
"compressed folders" ignore them.  WinZIP creates archives using them
(but we won't benefit from that unless we fix 
<https://issues.apache.org/bugzilla/show_bug.cgi?id=46637>).

For maximum interop it may be a good idea to write the extra fields,
but it will make the archives bigger.  That's why the current
ZipOutputStream doesn't write them by default - but it can be told to
do so.

ZipFile currently ignores the extra fields by default but can be told
to look for them.  It will ignore them if the language encoding flag
has been set.  It may be a good idea to look for the extra fields by
default since it really doesn't cost too much.

Defaults?
---------

I want to add new flags to <zip> and <unzip> (and thus the
subclasses).

<zip>:

* setLanguageEncodingFlag - doesn't do anything if the encoding is not
  UTF-8.  Controls whether ZipOutputStream sets the flag.

  I'd make that default to true.

* createUnicodeExtraFields

  Controls whether ZipOutputStream writes Unicode extra fields.

  I'd make that default to false.

<unzip>:

* parseUnicodeExtraFields

  Controls whether ZipFile searches for Unicode extra fields.

  I'm uncertain as to what the default should be.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: encoding in ZIP package

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-02-26, Dominique Devienne <dd...@gmail.com> wrote:

> There's only so much you can do when a loosely defined "format" like
> zip, and IMHO you've done plenty already.

compress is developing a community right now and I had quite a bit of
help, fortunately.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Re: encoding in ZIP package

Posted by Dominique Devienne <dd...@gmail.com>.
On Thu, Feb 26, 2009 at 8:16 AM, Stefan Bodewig <bo...@apache.org> wrote:
> over the past two weeks commons-compress has been adding stuff for
> more advanced ZIP features and I've merged the changes over to our zip
> package.  The changes bring two new options with them and I'd like to
> get some feedback as to which defaults our tasks should use wrt these
> options.
> [...]
> <zip>:
> * setLanguageEncodingFlag - doesn't do anything if the encoding is not
>  UTF-8.  Controls whether ZipOutputStream sets the flag.
>
>  I'd make that default to true.

Sounds reasonable.

> * createUnicodeExtraFields
>  Controls whether ZipOutputStream writes Unicode extra fields.
>
>  I'd make that default to false.
>
> <unzip>:

Same.

> * parseUnicodeExtraFields
>  Controls whether ZipFile searches for Unicode extra fields.
>
>  I'm uncertain as to what the default should be.

Don't know.

Not very helpful I'm afraid. I've read your message with interest, and
you obviously have thought thru a lot of the involved issues, and what
you write makes sense. Regarding parseUnicodeExtraFields, I'd keep the
existing behavior not now, the fact that the attribute exists to
resolve corner cases is enough in my mind.

There's only so much you can do when a loosely defined "format" like
zip, and IMHO you've done plenty already. --DD

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org