You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@poi.apache.org by Carey Sublette <ca...@overture.com> on 2003/01/30 17:04:16 UTC

Bad Data Issue in HSSF: Fix Proposed

This is an admittedly uncommon issue, but I encounter a problem from time to
time involving bad character values in xls files.

According to the Excel97 format, text data is supposed to be in Unicode.
There are 32 values in ISO-8859-1 (Latin-1) that are invalid, from 0x80 to
0x9F, and of course the same is true for the superset of Latin-1 which is
Unicode.

The Windows platform though has long used an 8 bit encoding called
windows-1252 (the IANA official designation, also called codepage-1252),
which extended Latin-1 by assigning 27 values in this range to symbols that
have a 16 bit representation in Unicode (see
http://www.microsoft.com/globaldev/reference/sbcs/1252.htm for the spec).
See also http://www.iana.org/assignments/character-sets and
http://www.iana.org/assignments/charset-reg/windows-1252 for reference.

Apparently some versions of Microsoft software (software and OS versions
unknown) make it possible to generate Excel97 files containing codepage-1252
character values. This causes problems when the data is passed on to other
processes. At the very least, the intended symbol is lost (replaced by a
"?") but it can also pass the invalid values on (depending of whether the
text string contained only 8 bit encodings or not) to cause mischief
elsewhere.

I encounter this in an environment where we import xls files submitted by
customers through the internet, which leads to a very mixed bag of files
showing up.

I propose a patch to:
org.apache.poi.hssf.record.UnicodeString.fillFields 
where the characters are actually copied from the byte buffer to
field_3_string.

In the case of 8 bit strings (the grbit is 0) the patch would be to use the
encoding "Cp1252" instead of  "ISO-8859-1" as the default. Since as defined
windows-1252 is a proper superset of Latin-1 this should always work. Pure
Latin-1 text will be properly translated, and so will windows-1252 (the 27
symbols not found in Latin-1 will become Unicode values with a non-zero high
byte). 

Alternatively, the 8 bit string could be scanned for bytes in the range
0x80-0x9F and if found the String constructor would be passed the encoding
"Cp1252" instead of "ISO-8859-1". The only reason for doing it this way is
would be if one theorized that deviant implementations on some version of
Java existed, and sought to reduce exposure to this.

For the 16 bit strings the patch would be to filter the characters as the
copy it done to explicitly translate the 27 characters if found.

There shouldn't be any issues going the other way, that is, writing Excel97
documents. Since Unicode is the documented standard (according to pg. 264 of
the Excel97 Developer's Kit) there would never be a need to generate
windows-1252 output.

I am implementing my recommended patch right now for my own use, and will
submit it for incorporation into Poi if you the team is interested.

Carey Sublette

Re: Bad Data Issue in HSSF: Fix Proposed

Posted by ac...@apache.org.

It sounds like you know what you're talking about.  Do it and submit
assocaited unit tests.  Please make sure you test with some non-english test
files (ask for some german, spanish, japanese, russian, korean and chinese
as you'll get the most contributions as we have the largest user bases of
those particular languages).

Pay attention to the "get involved" page so that your patch can be easily
applied.  (our principal method of selecting patches is by picking thte ones
which are submitted as we request, if the person took due care, then its a
reasonable expectation that the code will reflect this).

-Andy

-Andy
----- Original Message -----
From: "Carey Sublette" <ca...@overture.com>
To: <po...@jakarta.apache.org>
Cc: "Johnson Marc (E-mail)" <ma...@hotmail.com>
Sent: Thursday, January 30, 2003 11:04 AM
Subject: Bad Data Issue in HSSF: Fix Proposed


> This is an admittedly uncommon issue, but I encounter a problem from time
to
> time involving bad character values in xls files.
>
> According to the Excel97 format, text data is supposed to be in Unicode.
> There are 32 values in ISO-8859-1 (Latin-1) that are invalid, from 0x80 to
> 0x9F, and of course the same is true for the superset of Latin-1 which is
> Unicode.
>
> The Windows platform though has long used an 8 bit encoding called
> windows-1252 (the IANA official designation, also called codepage-1252),
> which extended Latin-1 by assigning 27 values in this range to symbols
that
> have a 16 bit representation in Unicode (see
> http://www.microsoft.com/globaldev/reference/sbcs/1252.htm for the spec).
> See also http://www.iana.org/assignments/character-sets and
> http://www.iana.org/assignments/charset-reg/windows-1252 for reference.
>
> Apparently some versions of Microsoft software (software and OS versions
> unknown) make it possible to generate Excel97 files containing
codepage-1252
> character values. This causes problems when the data is passed on to other
> processes. At the very least, the intended symbol is lost (replaced by a
> "?") but it can also pass the invalid values on (depending of whether the
> text string contained only 8 bit encodings or not) to cause mischief
> elsewhere.
>
> I encounter this in an environment where we import xls files submitted by
> customers through the internet, which leads to a very mixed bag of files
> showing up.
>
> I propose a patch to:
> org.apache.poi.hssf.record.UnicodeString.fillFields
> where the characters are actually copied from the byte buffer to
> field_3_string.
>
> In the case of 8 bit strings (the grbit is 0) the patch would be to use
the
> encoding "Cp1252" instead of  "ISO-8859-1" as the default. Since as
defined
> windows-1252 is a proper superset of Latin-1 this should always work. Pure
> Latin-1 text will be properly translated, and so will windows-1252 (the 27
> symbols not found in Latin-1 will become Unicode values with a non-zero
high
> byte).
>
> Alternatively, the 8 bit string could be scanned for bytes in the range
> 0x80-0x9F and if found the String constructor would be passed the encoding
> "Cp1252" instead of "ISO-8859-1". The only reason for doing it this way is
> would be if one theorized that deviant implementations on some version of
> Java existed, and sought to reduce exposure to this.
>
> For the 16 bit strings the patch would be to filter the characters as the
> copy it done to explicitly translate the 27 characters if found.
>
> There shouldn't be any issues going the other way, that is, writing
Excel97
> documents. Since Unicode is the documented standard (according to pg. 264
of
> the Excel97 Developer's Kit) there would never be a need to generate
> windows-1252 output.
>
> I am implementing my recommended patch right now for my own use, and will
> submit it for incorporation into Poi if you the team is interested.
>
> Carey Sublette
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
>