You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Wolfgang Glas <wo...@ev-i.at> on 2009/03/01 22:27:33 UTC

[compress] [PATCH] Refactoring of zip encoding support.

Hello all,

  Well, the latest discussions with Stefan showed two shortcoming of our current
ZIP unicode support:

1) Unicode extra fields are written for all ZIP entries and not only for
entries, which are not encodable by the encoding set to ZipArchiveOutputStream.

2) In order to implement selective writing of unicode specials, one needs a
robust implementation of wether a name can be encoded or not. This is
exspecially inhibtied by the fact, taht Cp437 has been omitted from java-1.4's
java.nio.Charset.

To overcome these shortcoming, I had to introduce a ZipEncoding interface plus a
java.nio implementation and a handcrafted implementation for Cp437 (and cp850)
and refactor all the encoding stuff.

The patch is attached. The new code is IMHO really better to read and make all
cp437-related stuff accessible on java-1.4 as well. The benfit from all these
efforts in the end is, that the cp437 test case now runs flawlessly under java-1.4

  Stefan, might you please review the patch and eventually apply this one?

   TIA,

    Wolfgang

Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-02, Wolfgang Glas <wo...@ev-i.at> wrote:

> ...another small patch with even more javadoc typos and superfluent imports
> fixed is attached.

svn rev 749524

Thanks

        Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> On 2009-03-02, Stefan Bodewig <bo...@apache.org> wrote:
> 
>> some cosmetics and commented out the "only create Unicode field for
>> non-encodable paths" part - svn revision 749342.
> 
> and 749344 - you misspet encoding in Simple8BitEncoding.java and I
> didn't see it in time.

...another small patch with even more javadoc typos and superfluent imports
fixed is attached. TIA for committing,

  Wolfgang

Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> On 2009-03-02, Stefan Bodewig <bo...@apache.org> wrote:
> 
>> some cosmetics and commented out the "only create Unicode field for
>> non-encodable paths" part - svn revision 749342.
> 
> and 749344 - you misspet encoding in Simple8BitEncoding.java and I
> didn't see it in time.

Yes, ThX, that's why we have code reviewers ;-)

Hopefully we will get the flaggery in ZipArchiveOutputStream staight during the
next days ;-)

Implementation of all kinds of policies should be quite easy, now my ZipEncoding
engine is committed.

  Regards, Wolfgang


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-02, Stefan Bodewig <bo...@apache.org> wrote:

> some cosmetics and commented out the "only create Unicode field for
> non-encodable paths" part - svn revision 749342.

and 749344 - you misspet encoding in Simple8BitEncoding.java and I
didn't see it in time.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-04, Wolfgang Glas <wo...@ev-i.at> wrote:

> Hello Stefan reviewed you code and found out, that you did not
> strictly use the same encoding for filenames and comments in one
> entry.

Good catch, thanks!

svn rev 750310

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> On 2009-03-04, Stefan Bodewig <bo...@apache.org> wrote:
> 
>> On 2009-03-03, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>>> The implementation should be be straightforward, shall I prepare a
>>> patch or can you afford doing it at your own?
> 
>> Will do it myself.
> 
> svn revisions 749906 and 749907

Hello Stefan reviewed you code and found out, that you did not strictly use the
same encoding for filenames and comments in one entry.

A patch, which corrects this behaviour is attached.

  Regards,

    Wolfgang

Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> On 2009-03-04, Stefan Bodewig <bo...@apache.org> wrote:
> 
>> On 2009-03-03, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>>> The implementation should be be straightforward, shall I prepare a
>>> patch or can you afford doing it at your own?
> 
>> Will do it myself.
> 
> svn revisions 749906 and 749907

ThX very much, Stefan ;-)

I will have a look at the new version and give you feedback in the evening.

  Wolfgang


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-04, Stefan Bodewig <bo...@apache.org> wrote:

> On 2009-03-03, Wolfgang Glas <wo...@ev-i.at> wrote:

>> The implementation should be be straightforward, shall I prepare a
>> patch or can you afford doing it at your own?

> Will do it myself.

svn revisions 749906 and 749907

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-03, Wolfgang Glas <wo...@ev-i.at> wrote:

> Stefan Bodewig schrieb:
>> On 2009-03-03, Wolfgang Glas <wo...@ev-i.at> wrote:

>>> Stefan Bodewig schrieb:
>>>> On 2009-03-02, Wolfgang Glas <wo...@ev-i.at> wrote:

>>> Acccording to my tests WinZip recognizes the EFS flag upon
>>> reading.

>> Then my documenation is wrong 8-)

> Sorry for not exactly reading the Documentation,

No problem, will fix it.

>>> Secondly, if you set the encoding to UTF-8, there's no need for
>>> unicode extra fields anyway.

>> Except when your client doesnt recognize the EFS flag and thinks you'd
>> be using CP437 - but happily accepts the Unicode extra fields.  I
>> thought this would be the case for WinZIP.

> Yes, the EFS flag is of little usefulness. It has been added very
> late to Specs and most implementation ignore it right away.

IIUC it is detected by PKWARE, 7Zip and WinZIP, commons-compress - and
InfoZIP's unzip as well as Ant if the version is recent enough.  Not
too bad.

>> You vastly overestimate the effort it takes to write an Ant task.

> That nice, however, I hope that I can avoid adding ant to the list
> of OS-project I participate in ;-)

You already contributed to it, thanks.

>> createUnicodeExtraFields = NEVER (default) | ALWAYS | NOT_ENCODABLE
>> useLanguageEncodingFlag = true (default) | false
>> fallbackToUtf8 = true | false

> Agreed ;-)

> The implementation should be be straightforward, shall I prepare a
> patch or can you afford doing it at your own?

Will do it myself.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> On 2009-03-03, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>> Stefan Bodewig schrieb:
>>> On 2009-03-02, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>>>> Stefan Bodewig schrieb:
>>>>> On 2009-03-01, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>>>>>> 1) Unicode extra fields are written for all ZIP entries and not only
>>>>>> for entries, which are not encodable by the encoding set to
>>>>>> ZipArchiveOutputStream.
> 
>>>>> Maybe room for yet another flag?  Or an enum-like option
> 
>>>>> setCreateUnicodeExtraFields(NEVER | ALWAYS | NOT_ENCODABLE)
> 
>>> Consider the WinZIP case, WinZIP wouldn't recognize the EFS.  If you
>>> set the encoding to UTF-8 and use your code and only add extra fields
>>> for non-encodable paths, WinZIP will never see the correct path.
> 
>> Acccording to my tests WinZip recognizes the EFS flag upon
>> reading.
> 
> Then my documenation is wrong 8-)

Sorry for not exactly reading the Documentation, but I got stuck because the EFS
flag seemed to be not enough for me and I wanted to get this straight before.
But I think we've come a long way and the end i near ;-)

>> Secondly, if you set the encoding to UTF-8, there's no need for
>> unicode extra fields anyway.
> 
> Except when your client doesnt recognize the EFS flag and thinks you'd
> be using CP437 - but happily accepts the Unicode extra fields.  I
> thought this would be the case for WinZIP.

Yes, the EFS flag is of little usefulness. It has been added very late to Specs
and most implementation ignore it right away. Hence thes introduced extra fields
and now we have to live with both 8-)

>>> but looking at the names we may be better off with two independent
>>> options.  Hmm, yes, right now I prefer two flags because they seem to
>>> be orthogonal.
> 
>> I think you should choose, which approach better fits your needs in
>> ant ;-) At least you have to write an XML parser for these settings
> 
> You vastly overestimate the effort it takes to write an Ant task.
> 
> http://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/taskdefs/Zip.java?r1=738330&r2=748593
> 
> is all I had to do for the two existing options.

That nice, however, I hope that I can avoid adding ant to the list of OS-project
I participate in ;-)

>> and the documentation, so you might choose the approach which may be
>> explained in brief words.
> 
>> I can live very well with two options ;-)
> 
> If you throw in "fallbacks" we are actually facing three concepts.
> 
> OK, this is what I feel makes most sense:
> 
> createUnicodeExtraFields = NEVER (default) | ALWAYS | NOT_ENCODABLE
> useLanguageEncodingFlag = true (default) | false
> fallbackToUtf8 = true | false

Agreed ;-)

> I'm not sure about the default for the later, probably
> default fallbackToUtf8 = (createUnicodeExtraFields == NEVER)

The default for the later should be false, it is a special option for people who
now, what they are doing.

The implementation should be be straightforward, shall I prepare a patch or can
you afford doing it at your own?

  Regards,

    Wolfgang


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-03, Wolfgang Glas <wo...@ev-i.at> wrote:

> Stefan Bodewig schrieb:
>> On 2009-03-02, Wolfgang Glas <wo...@ev-i.at> wrote:

>>> Stefan Bodewig schrieb:
>>>> On 2009-03-01, Wolfgang Glas <wo...@ev-i.at> wrote:

>>>>> 1) Unicode extra fields are written for all ZIP entries and not only
>>>>> for entries, which are not encodable by the encoding set to
>>>>> ZipArchiveOutputStream.

>>>> Maybe room for yet another flag?  Or an enum-like option

>>>> setCreateUnicodeExtraFields(NEVER | ALWAYS | NOT_ENCODABLE)

>> Consider the WinZIP case, WinZIP wouldn't recognize the EFS.  If you
>> set the encoding to UTF-8 and use your code and only add extra fields
>> for non-encodable paths, WinZIP will never see the correct path.

> Acccording to my tests WinZip recognizes the EFS flag upon
> reading.

Then my documenation is wrong 8-)

> Secondly, if you set the encoding to UTF-8, there's no need for
> unicode extra fields anyway.

Except when your client doesnt recognize the EFS flag and thinks you'd
be using CP437 - but happily accepts the Unicode extra fields.  I
thought this would be the case for WinZIP.

>> but looking at the names we may be better off with two independent
>> options.  Hmm, yes, right now I prefer two flags because they seem to
>> be orthogonal.

> I think you should choose, which approach better fits your needs in
> ant ;-) At least you have to write an XML parser for these settings

You vastly overestimate the effort it takes to write an Ant task.

http://svn.apache.org/viewvc/ant/core/trunk/src/main/org/apache/tools/ant/taskdefs/Zip.java?r1=738330&r2=748593

is all I had to do for the two existing options.

> and the documentation, so you might choose the approach which may be
> explained in brief words.

> I can live very well with two options ;-)

If you throw in "fallbacks" we are actually facing three concepts.

OK, this is what I feel makes most sense:

createUnicodeExtraFields = NEVER (default) | ALWAYS | NOT_ENCODABLE
useLanguageEncodingFlag = true (default) | false
fallbackToUtf8 = true | false

I'm not sure about the default for the later, probably
default fallbackToUtf8 = (createUnicodeExtraFields == NEVER)

Unfortunately I don't really see how we can merge all permutations
into meaningful names otherwise.  But suggestions are welcome.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> On 2009-03-02, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>> Stefan Bodewig schrieb:
>>> On 2009-03-01, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>>>> 1) Unicode extra fields are written for all ZIP entries and not only
>>>> for entries, which are not encodable by the encoding set to
>>>> ZipArchiveOutputStream.
> 
>>> Maybe room for yet another flag?  Or an enum-like option
> 
>>> setCreateUnicodeExtraFields(NEVER | ALWAYS | NOT_ENCODABLE)
> 
> Consider the WinZIP case, WinZIP wouldn't recognize the EFS.  If you
> set the encoding to UTF-8 and use your code and only add extra fields
> for non-encodable paths, WinZIP will never see the correct path.

Acccording to my tests WinZip recognizes the EFS flag upon reading. Upon writing
WinZip uses extra fields and encodes filenames as Cp437, which is really the
most useful variant these days.

Secondly, if you set the encoding to UTF-8, there's no need for unicode extra
fields anyway. But as mentioned above, the most portable tool-readable variant
as requested by the reporter of the original SANDBOX-176 issue is writing Cp437
and adding unicode extra fields. EFS support in the wild is not really
widespread, propably due to a mid-air collision between specification writing
and omplementation of widespread ZIP-Implementations....

>> I like the idea of a unicode policy flag ;-)
> 
> May be a better approach, agreed.  But only if we manage to cover all
> border cases.
> 
>> My suggestion is
> 
>> setUnicodePolicy(
>>   SURROGATES   | /* no extra fields, no utf-8 fallback, only %Uxxxx surrogates*/
>>   EXTRA_FIELDS | /* extra fields for unencodable entriey, no utf-8 fallback   */
>>   EXTRA_FIELDS_ALWAYS | /* extra fields for all entries, no utf-8 fallback    */
>>   UTF8_FALLBACK| /* fall back to utf-8 plus EFS flag for unencodable entries. */
>>   UTF8_FALLBACK_EXTRA_FIELDS| /* fall back to utf-8 plus EFS flag plus extra
>>                                  fields for unencodable */
>>   UTF8_FALLBACK_EXTRA_FIELDS_ALWAYS /* fall back to utf-8 plus EFS flag for
>>                                        unencodable entries, exta fields for all
>>                                        entries. */
>> )
> 
>> We might drop the last two options and we might choose a better
>> wording, however the direction should IMHO be as above mentioned...
> 
> This covers all permutations, agreed.
> 
> Names, names, I'm really bad at them.
> 
> EXTRA_FIELDS                      => ADD_EXTRA_FIELDS_FOR_UNENCODABLE
> EXTRA_FIELDS_ALWAYS               => ADD_EXTRA_FIELDS
> UTF8_FALLBACK                     => FALL_BACK_TO_UTF8
> UTF8_FALLBACK_EXTRA_FIELDS        => FALL_BACK_TO_UTF8_PLUS_EXTRA_FIELD
> UTF8_FALLBACK_EXTRA_FIELDS_ALWAYS => FALL_BACK_TO_UTF8_ADD_EXTRA_FIELDS
> 
> but looking at the names we may be better off with two independent
> options.  Hmm, yes, right now I prefer two flags because they seem to
> be orthogonal.

I think you should choose, which approach better fits your needs in ant ;-) At
least you have to write an XML parser for these settings and the documentation,
so you might choose the approach which may be explained in brief words.

I can live very well with two options ;-)

  Wolfgang

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-02, Wolfgang Glas <wo...@ev-i.at> wrote:

> Stefan Bodewig schrieb:
>> On 2009-03-01, Wolfgang Glas <wo...@ev-i.at> wrote:

>>> 1) Unicode extra fields are written for all ZIP entries and not only
>>> for entries, which are not encodable by the encoding set to
>>> ZipArchiveOutputStream.

>> Maybe room for yet another flag?  Or an enum-like option

>> setCreateUnicodeExtraFields(NEVER | ALWAYS | NOT_ENCODABLE)

Consider the WinZIP case, WinZIP wouldn't recognize the EFS.  If you
set the encoding to UTF-8 and use your code and only add extra fields
for non-encodable paths, WinZIP will never see the correct path.


> I like the idea of a unicode policy flag ;-)

May be a better approach, agreed.  But only if we manage to cover all
border cases.

> My suggestion is

> setUnicodePolicy(
>   SURROGATES   | /* no extra fields, no utf-8 fallback, only %Uxxxx surrogates*/
>   EXTRA_FIELDS | /* extra fields for unencodable entriey, no utf-8 fallback   */
>   EXTRA_FIELDS_ALWAYS | /* extra fields for all entries, no utf-8 fallback    */
>   UTF8_FALLBACK| /* fall back to utf-8 plus EFS flag for unencodable entries. */
>   UTF8_FALLBACK_EXTRA_FIELDS| /* fall back to utf-8 plus EFS flag plus extra
>                                  fields for unencodable */
>   UTF8_FALLBACK_EXTRA_FIELDS_ALWAYS /* fall back to utf-8 plus EFS flag for
>                                        unencodable entries, exta fields for all
>                                        entries. */
> )

> We might drop the last two options and we might choose a better
> wording, however the direction should IMHO be as above mentioned...

This covers all permutations, agreed.

Names, names, I'm really bad at them.

EXTRA_FIELDS                      => ADD_EXTRA_FIELDS_FOR_UNENCODABLE
EXTRA_FIELDS_ALWAYS               => ADD_EXTRA_FIELDS
UTF8_FALLBACK                     => FALL_BACK_TO_UTF8
UTF8_FALLBACK_EXTRA_FIELDS        => FALL_BACK_TO_UTF8_PLUS_EXTRA_FIELD
UTF8_FALLBACK_EXTRA_FIELDS_ALWAYS => FALL_BACK_TO_UTF8_ADD_EXTRA_FIELDS

but looking at the names we may be better off with two independent
options.  Hmm, yes, right now I prefer two flags because they seem to
be orthogonal.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Wolfgang Glas <wo...@ev-i.at>.
Stefan Bodewig schrieb:
> On 2009-03-01, Wolfgang Glas <wo...@ev-i.at> wrote:
> 
>> 1) Unicode extra fields are written for all ZIP entries and not only
>> for entries, which are not encodable by the encoding set to
>> ZipArchiveOutputStream.
> 
> Maybe room for yet another flag?  Or an enum-like option
> 
> setCreateUnicodeExtraFields(NEVER | ALWAYS | NOT_ENCODABLE)

I like the idea of a unicode policy flag ;-)

My suggestion is

setUnicodePolicy(
  SURROGATES   | /* no extra fields, no utf-8 fallback, only %Uxxxx surrogates*/
  EXTRA_FIELDS | /* extra fields for unencodable entriey, no utf-8 fallback   */
  EXTRA_FIELDS_ALWAYS | /* extra fields for all entries, no utf-8 fallback    */
  UTF8_FALLBACK| /* fall back to utf-8 plus EFS flag for unencodable entries. */
  UTF8_FALLBACK_EXTRA_FIELDS| /* fall back to utf-8 plus EFS flag plus extra
                                 fields for unencodable */
  UTF8_FALLBACK_EXTRA_FIELDS_ALWAYS /* fall back to utf-8 plus EFS flag for
                                       unencodable entries, exta fields for all
                                       entries. */
)

We might drop the last two options and we might choose a better wording, however
the direction should IMHO be as above mentioned...

  Regards,

    Wolfgang


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Re: [compress] [PATCH] Refactoring of zip encoding support.

Posted by Stefan Bodewig <bo...@apache.org>.
On 2009-03-01, Wolfgang Glas <wo...@ev-i.at> wrote:

> 1) Unicode extra fields are written for all ZIP entries and not only
> for entries, which are not encodable by the encoding set to
> ZipArchiveOutputStream.

Maybe room for yet another flag?  Or an enum-like option

setCreateUnicodeExtraFields(NEVER | ALWAYS | NOT_ENCODABLE)

?

I've commented out that part of your patch.

> To overcome these shortcoming, I had to introduce a ZipEncoding
> interface plus a java.nio implementation and a handcrafted
> implementation for Cp437 (and cp850) and refactor all the encoding
> stuff.

Looks good.

> The patch is attached. The new code is IMHO really better to read
> and make all cp437-related stuff accessible on java-1.4 as well.

Agreed, many thanks!

> Stefan, might you please review the patch and eventually apply this one?

some cosmetics and commented out the "only create Unicode field for
non-encodable paths" part - svn revision 749342.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org