You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@commons.apache.org by Stefan Bodewig <bo...@apache.org> on 2012/03/18 18:37:06 UTC

[compress] encoding in tar package

Hi all,

I've been working on COMPRESS-183 which is a more general version of
COMPRESS-114 we fixed a while ago.  It asks for support of non-ASCII
file names in tar archives by using an explicit encoding (COMPRESS-114
made things work for ISO-8859-1 and any other encoding that creates the
same bytes for chars 0 to 255).

tar itself doesn't support anything but ASCII and only the later POSIX
versions added support for UTF-8 via PAX extension headers (something I
intend to add).  Most tar dialects will use the platform's default
encoding for non-ASCII names.

I have checked in some initial infrastructure that reuses the zip
package's encoding classes and already allows reading of any encoding,
adding write support will be trivial.  The patch is more convoluted than
I had hoped as the tar package has way too many public methods and I had
to work around backwards compatibility issues including swallowing
exceptions that may occur if the specified encoding doesn't work for the
name/bytes.  This is something to address in compress 2.x (that I hope
to kick off after releasing 1.4).

Anyway, the current code changes one thing: it now defaults to using the
platform's default encoding, while the 1.3 version specifically supports
iso-8859-1 (and nothing else).  Anybody who relied on iso-8859-1 being
the default will have to change the code to explicitly ask for it.  Is
this acceptable or do I need to change the default?

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org