You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Christian Gosch <c....@inovex.de> on 2005/11/09 09:58:06 UTC

Re: [poi] Problem with encoding

Hi,

that would be of particular interest for me, too.

We have some international names in our application, although it runs in a
ISO-Latin-1 (ISO-8859-1) [db, appserver] / Cp1252 [client] environment with
deDE locale by default.

We have several areas of "visibility" like DB (VarChar fields), Java source
files, appserver console, JSP source / rendering / display, PDF and XLS
download.

Actually we use the last POI final (should be 2.5.1?), and I do not remember
any possibility of setting the encoding for String values in a sheet. Since
the XLS file format is kind of a "hybrid" one, mixed up from binary
structure / control data and textual content data, it is crucial to fill in
all textual "content" with the appropriate encoding -- and that one should
be subject to set up / choose.

Testing some examples I found that
- very most characters found in our data are displayed as they should, in
JSP and XLS (by POI).
- the czech "s with v on top" is displayed well in JSPs, but not in POI
generated XLS: There it shows up as "little rectangle".
I know that in ISO-8859-1 there are also problems with danish "o with slash"
also, but currently I have no test data. Also I would expect problems with
turkish letters like "i without dot" or "c with bottom accent", like in the
city name "Incirlik", when written correctly.

btw:
In JXL (JExcelAPI) it is posible to set up an encoding for a generated XLS
file, which by default is "the default encoding of the hosting VM", but it
took a while to make that happen.


Regards
Christian Gosch
inovex GmbH



On Tuesday, November 08, 2005 11:59 PM [GMT+1=CET],
Olivier Matt <in...@kodee.org> wrote:

> Hello,
>
> I'm reading excel files and I get from a CELL_TYPE_STRING cell a
> String.
>
> That string has some problems with accents (I guess the file is
> encoded using
> some latin-characters encoding), they are not seen properly.
>
> How can I avoid this behavior ? Can I specify somewhere the encoding
> of the cells ?
> Or is there a method for transforming misinterpreted strings to good
> latin-strings ?
>
>
> Thanks for help,
>
> Olivier
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
> The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

-- 
Dipl.-Inform. Christian Gosch
Systems Development
inovex GmbH
mailto:c.gosch@inovex.de
http://www.inovex.de


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: [poi] Problem with encoding

Posted by ac...@apache.org.
Christian Gosch wrote:
> However, there are other Windows localizations over the world which don't
> use Cp1252 -- to go to the extremes, look at asian versions supporting
> Traditional Chinese, Japanese or the like. Even russian, korean, hebrew are
> sold, and they all have completely different charsets -- and the interesting
> thing about it is, that I can view / edit files created there on other
> localizations! So there *must* be *some* way to (a) encode it and (b) tell,
> what the actual encoding is.
>

You're actually out of luck with any right-left language ATM unless 
someone takes the time to decode the "special" "undocumented" bits 
necessary to run them.  Also there are special "far east" bits in the 
fileformat that we preserve but have not reverse engineered.  The OO.o 
guys also seem to have not yet done this (these were never documented 
anywhere and you'd need to know Chinese or Japanese to really do this work).

> Look at the implementation of JXL. It is not my work, but we used it a lot
> before POI. In the first versions, there was no idea of supporting something
> like non-US chars, but after some weeks of discussion the developer of JXL
> got the message, and *did* implement support of different encodings. Since
> this product is open source, it should be possible. Look for JExcepAPI
> (http://www.andykhan.com/jexcelapi/).
> 

We should not and can not look at it.  Its license means we might be 
producing a derivative work and encumbering POI.  (They are not bound by 
such a problem due to the ASL license being so permissive)  Fortuantely, 
we're pretty good at figuring things out ourselves.

> Please, don't let all of us non-US developers in 'good old europe' and
> wherever else not starvate by lack of custom encoding support...
> 

Generally speaking insulting people tends to make them stop helping you. 
  POI was used on a widespread basis first in Germany due to its 
coverage in the German tech media long before it was used in the US. 
There are only three committers from the US on the project.  Most of the 
committers from other parts of the world.

Tsch??s,

-Andy

> Regards
> Christian Gosch
> inovex GmbH
> 
> 
> On Wednesday, November 09, 2005 11:07 PM [GMT+1=CET],
> acoliver@apache.org <ac...@apache.org> wrote:
> 
> 
>>Excel wants cp1252 for most things...  It just does.  When I get home
>>(I'm on the road) I'll look at the dev kit...it may be that by
>>changing the codepage record we can handle things a bit nicer, but
>>eeez kinda picky about that and regardless of what AIX may support,
>>when
>>you open the Excel sheet it will be on Windows generally (or a
>>semi-emulation of it on Mac/Linux) and you'll have to write it in an
>>encoding supported by Excel for Windows...
>>
>>-Andy
>>
>>Rainer Klute wrote:
>>
>>
>>>Am Mittwoch, den 09.11.2005, 07:25 -0500 schrieb acoliver@apache.org:
>>>
>>>
>>>
>>>>We should be universally handling the issues mentioned here:
>>>>http://en.wikipedia.org/wiki/Windows-1252 by intercepting character
>>>>differences and writing them out properly.  Thus HSSF should force
>>>>8859-1 encoding but should then kind of do a replace on the
>>>> characters. If someone wants to contribute I can point them in the
>>>>right direction.
>>>>
>>>>
>>>
>>>Um, no. Enforcing ISO 8859-1 as character code would be of limited
>>>use only. These reason is that like Windows Codepage 1252 it
>>>represents only a limited set of characters. UTF-8 is the preferred
>>>character encoding. However, POI should not forbid to create strings
>>>in other character encodings, be it ISO 8859-1, cp1252 or whatever.
>>>
>>>By the way, HPSF does a nice job of supporting a lot of different
>>>character encodings. At least there are no problems I am aware of. I
>>>suggest you have a look at it.
>>>
>>>Best regards
>>>Rainer Klute
>>>
>>>                          Rainer Klute IT-Consulting GmbH
>>> Dipl.-Inform.
>>> Rainer Klute             E-Mail:  klute@rainer-klute.de
>>> K??rner Grund 24          Telefon: +49 172 2324824
>>>D-44143 Dortmund           Telefax: +49 231 5349423
>>>
>>>Public key fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E
> 
> 
> Gruesse,


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: [poi] Problem with encoding

Posted by Christian Gosch <c....@inovex.de>.
However, there are other Windows localizations over the world which don't
use Cp1252 -- to go to the extremes, look at asian versions supporting
Traditional Chinese, Japanese or the like. Even russian, korean, hebrew are
sold, and they all have completely different charsets -- and the interesting
thing about it is, that I can view / edit files created there on other
localizations! So there *must* be *some* way to (a) encode it and (b) tell,
what the actual encoding is.

Look at the implementation of JXL. It is not my work, but we used it a lot
before POI. In the first versions, there was no idea of supporting something
like non-US chars, but after some weeks of discussion the developer of JXL
got the message, and *did* implement support of different encodings. Since
this product is open source, it should be possible. Look for JExcepAPI
(http://www.andykhan.com/jexcelapi/).

Please, don't let all of us non-US developers in 'good old europe' and
wherever else not starvate by lack of custom encoding support...

Regards
Christian Gosch
inovex GmbH


On Wednesday, November 09, 2005 11:07 PM [GMT+1=CET],
acoliver@apache.org <ac...@apache.org> wrote:

> Excel wants cp1252 for most things...  It just does.  When I get home
> (I'm on the road) I'll look at the dev kit...it may be that by
> changing the codepage record we can handle things a bit nicer, but
> eeez kinda picky about that and regardless of what AIX may support,
> when
> you open the Excel sheet it will be on Windows generally (or a
> semi-emulation of it on Mac/Linux) and you'll have to write it in an
> encoding supported by Excel for Windows...
>
> -Andy
>
> Rainer Klute wrote:
>
>> Am Mittwoch, den 09.11.2005, 07:25 -0500 schrieb acoliver@apache.org:
>>
>>
>>> We should be universally handling the issues mentioned here:
>>> http://en.wikipedia.org/wiki/Windows-1252 by intercepting character
>>> differences and writing them out properly.  Thus HSSF should force
>>> 8859-1 encoding but should then kind of do a replace on the
>>>  characters. If someone wants to contribute I can point them in the
>>> right direction.
>>>
>>>
>>
>> Um, no. Enforcing ISO 8859-1 as character code would be of limited
>> use only. These reason is that like Windows Codepage 1252 it
>> represents only a limited set of characters. UTF-8 is the preferred
>> character encoding. However, POI should not forbid to create strings
>> in other character encodings, be it ISO 8859-1, cp1252 or whatever.
>>
>> By the way, HPSF does a nice job of supporting a lot of different
>> character encodings. At least there are no problems I am aware of. I
>> suggest you have a look at it.
>>
>> Best regards
>> Rainer Klute
>>
>>                           Rainer Klute IT-Consulting GmbH
>>  Dipl.-Inform.
>>  Rainer Klute             E-Mail:  klute@rainer-klute.de
>>  K??rner Grund 24          Telefon: +49 172 2324824
>> D-44143 Dortmund           Telefax: +49 231 5349423
>>
>> Public key fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E

Gruesse,
-- 
Dipl.-Inform. Christian Gosch
Systems Development
inovex GmbH
Karlsruher Strasse 71
D-75179 Pforzheim
Tel.: +49 (0)72 31 - 31 91 - 85
Fax: +49 (0)72 31 - 31 91 - 91
mailto:c.gosch@inovex.de
http://www.inovex.de


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: [poi] Problem with encoding

Posted by ac...@apache.org.
Excel wants cp1252 for most things...  It just does.  When I get home 
(I'm on the road) I'll look at the dev kit...it may be that by changing the
codepage record we can handle things a bit nicer, but eeez kinda picky 
about that and regardless of what AIX may support, when
you open the Excel sheet it will be on Windows generally (or a 
semi-emulation of it on Mac/Linux) and you'll have to write it in an
encoding supported by Excel for Windows...

-Andy

Rainer Klute wrote:

>Am Mittwoch, den 09.11.2005, 07:25 -0500 schrieb acoliver@apache.org:
>  
>
>>We should be universally handling the issues mentioned here: 
>>http://en.wikipedia.org/wiki/Windows-1252 by intercepting character 
>>differences and writing them out properly.  Thus HSSF should force 
>>8859-1 encoding but should then kind of do a replace on the characters. 
>>  If someone wants to contribute I can point them in the right direction.
>>    
>>
>
>Um, no. Enforcing ISO 8859-1 as character code would be of limited use
>only. These reason is that like Windows Codepage 1252 it represents only
>a limited set of characters. UTF-8 is the preferred character encoding.
>However, POI should not forbid to create strings in other character
>encodings, be it ISO 8859-1, cp1252 or whatever.
>
>By the way, HPSF does a nice job of supporting a lot of different
>character encodings. At least there are no problems I am aware of. I
>suggest you have a look at it.
>
>Best regards
>Rainer Klute
>
>                           Rainer Klute IT-Consulting GmbH
>  Dipl.-Inform.
>  Rainer Klute             E-Mail:  klute@rainer-klute.de
>  K??rner Grund 24          Telefon: +49 172 2324824
>D-44143 Dortmund           Telefax: +49 231 5349423
>
>Public key fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E
>  
>


-- 
Andrew C. Oliver
SuperLink Software, Inc.

Java to Excel using POI
http://www.superlinksoftware.com/services/poi
Commercial support including features added/implemented, bugs fixed.



---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: [poi] Problem with encoding

Posted by ac...@apache.org.
Excel wants cp1252 for most things...  It just does.  When I get home 
(I'm on the road) I'll look at the dev kit...it may be that by changing the
codepage record we can handle things a bit nicer, but eeez kinda picky 
about that and regardless of what AIX may support, when
you open the Excel sheet it will be on Windows generally (or a 
semi-emulation of it on Mac/Linux) and you'll have to write it in an
encoding supported by Excel for Windows...

-Andy

Rainer Klute wrote:

>Am Mittwoch, den 09.11.2005, 07:25 -0500 schrieb acoliver@apache.org:
>  
>
>>We should be universally handling the issues mentioned here: 
>>http://en.wikipedia.org/wiki/Windows-1252 by intercepting character 
>>differences and writing them out properly.  Thus HSSF should force 
>>8859-1 encoding but should then kind of do a replace on the characters. 
>>  If someone wants to contribute I can point them in the right direction.
>>    
>>
>
>Um, no. Enforcing ISO 8859-1 as character code would be of limited use
>only. These reason is that like Windows Codepage 1252 it represents only
>a limited set of characters. UTF-8 is the preferred character encoding.
>However, POI should not forbid to create strings in other character
>encodings, be it ISO 8859-1, cp1252 or whatever.
>
>By the way, HPSF does a nice job of supporting a lot of different
>character encodings. At least there are no problems I am aware of. I
>suggest you have a look at it.
>
>Best regards
>Rainer Klute
>
>                           Rainer Klute IT-Consulting GmbH
>  Dipl.-Inform.
>  Rainer Klute             E-Mail:  klute@rainer-klute.de
>  K??rner Grund 24          Telefon: +49 172 2324824
>D-44143 Dortmund           Telefax: +49 231 5349423
>
>Public key fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E
>  
>


-- 
Andrew C. Oliver
SuperLink Software, Inc.

Java to Excel using POI
http://www.superlinksoftware.com/services/poi
Commercial support including features added/implemented, bugs fixed.



---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


Re: [poi] Problem with encoding

Posted by Rainer Klute <kl...@rainer-klute.de>.
Am Mittwoch, den 09.11.2005, 07:25 -0500 schrieb acoliver@apache.org:
> We should be universally handling the issues mentioned here: 
> http://en.wikipedia.org/wiki/Windows-1252 by intercepting character 
> differences and writing them out properly.  Thus HSSF should force 
> 8859-1 encoding but should then kind of do a replace on the characters. 
>   If someone wants to contribute I can point them in the right direction.

Um, no. Enforcing ISO 8859-1 as character code would be of limited use
only. These reason is that like Windows Codepage 1252 it represents only
a limited set of characters. UTF-8 is the preferred character encoding.
However, POI should not forbid to create strings in other character
encodings, be it ISO 8859-1, cp1252 or whatever.

By the way, HPSF does a nice job of supporting a lot of different
character encodings. At least there are no problems I am aware of. I
suggest you have a look at it.

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting GmbH
  Dipl.-Inform.
  Rainer Klute             E-Mail:  klute@rainer-klute.de
  Körner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423

Public key fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E

Re: [poi] Problem with encoding

Posted by Rainer Klute <kl...@rainer-klute.de>.
Am Mittwoch, den 09.11.2005, 07:25 -0500 schrieb acoliver@apache.org:
> We should be universally handling the issues mentioned here: 
> http://en.wikipedia.org/wiki/Windows-1252 by intercepting character 
> differences and writing them out properly.  Thus HSSF should force 
> 8859-1 encoding but should then kind of do a replace on the characters. 
>   If someone wants to contribute I can point them in the right direction.

Um, no. Enforcing ISO 8859-1 as character code would be of limited use
only. These reason is that like Windows Codepage 1252 it represents only
a limited set of characters. UTF-8 is the preferred character encoding.
However, POI should not forbid to create strings in other character
encodings, be it ISO 8859-1, cp1252 or whatever.

By the way, HPSF does a nice job of supporting a lot of different
character encodings. At least there are no problems I am aware of. I
suggest you have a look at it.

Best regards
Rainer Klute

                           Rainer Klute IT-Consulting GmbH
  Dipl.-Inform.
  Rainer Klute             E-Mail:  klute@rainer-klute.de
  Körner Grund 24          Telefon: +49 172 2324824
D-44143 Dortmund           Telefax: +49 231 5349423

Public key fingerprint: E4E4386515EE0BED5C162FBB5343461584B5A42E

Re: [poi] Problem with encoding

Posted by ac...@apache.org.
We should be universally handling the issues mentioned here: 
http://en.wikipedia.org/wiki/Windows-1252 by intercepting character 
differences and writing them out properly.  Thus HSSF should force 
8859-1 encoding but should then kind of do a replace on the characters. 
  If someone wants to contribute I can point them in the right direction.

-andy

Christian Gosch wrote:
> Hi,
> 
> that would be of particular interest for me, too.
> 
> We have some international names in our application, although it runs in a
> ISO-Latin-1 (ISO-8859-1) [db, appserver] / Cp1252 [client] environment with
> deDE locale by default.
> 
> We have several areas of "visibility" like DB (VarChar fields), Java source
> files, appserver console, JSP source / rendering / display, PDF and XLS
> download.
> 
> Actually we use the last POI final (should be 2.5.1?), and I do not remember
> any possibility of setting the encoding for String values in a sheet. Since
> the XLS file format is kind of a "hybrid" one, mixed up from binary
> structure / control data and textual content data, it is crucial to fill in
> all textual "content" with the appropriate encoding -- and that one should
> be subject to set up / choose.
> 
> Testing some examples I found that
> - very most characters found in our data are displayed as they should, in
> JSP and XLS (by POI).
> - the czech "s with v on top" is displayed well in JSPs, but not in POI
> generated XLS: There it shows up as "little rectangle".
> I know that in ISO-8859-1 there are also problems with danish "o with slash"
> also, but currently I have no test data. Also I would expect problems with
> turkish letters like "i without dot" or "c with bottom accent", like in the
> city name "Incirlik", when written correctly.
> 
> btw:
> In JXL (JExcelAPI) it is posible to set up an encoding for a generated XLS
> file, which by default is "the default encoding of the hosting VM", but it
> took a while to make that happen.
> 
> 
> Regards
> Christian Gosch
> inovex GmbH
> 
> 
> 
> On Tuesday, November 08, 2005 11:59 PM [GMT+1=CET],
> Olivier Matt <in...@kodee.org> wrote:
> 
> 
>>Hello,
>>
>>I'm reading excel files and I get from a CELL_TYPE_STRING cell a
>>String.
>>
>>That string has some problems with accents (I guess the file is
>>encoded using
>>some latin-characters encoding), they are not seen properly.
>>
>>How can I avoid this behavior ? Can I specify somewhere the encoding
>>of the cells ?
>>Or is there a method for transforming misinterpreted strings to good
>>latin-strings ?
>>
>>
>>Thanks for help,
>>
>>Olivier
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
> 
> 


-- 
Andrew C. Oliver
SuperLink Software, Inc.

Java to Excel using POI
http://www.superlinksoftware.com/services/poi
Commercial support including features added/implemented, bugs fixed.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
Mailing List:    http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta POI Project: http://jakarta.apache.org/poi/


Re: [poi] Problem with encoding

Posted by ac...@apache.org.
We should be universally handling the issues mentioned here: 
http://en.wikipedia.org/wiki/Windows-1252 by intercepting character 
differences and writing them out properly.  Thus HSSF should force 
8859-1 encoding but should then kind of do a replace on the characters. 
  If someone wants to contribute I can point them in the right direction.

-andy

Christian Gosch wrote:
> Hi,
> 
> that would be of particular interest for me, too.
> 
> We have some international names in our application, although it runs in a
> ISO-Latin-1 (ISO-8859-1) [db, appserver] / Cp1252 [client] environment with
> deDE locale by default.
> 
> We have several areas of "visibility" like DB (VarChar fields), Java source
> files, appserver console, JSP source / rendering / display, PDF and XLS
> download.
> 
> Actually we use the last POI final (should be 2.5.1?), and I do not remember
> any possibility of setting the encoding for String values in a sheet. Since
> the XLS file format is kind of a "hybrid" one, mixed up from binary
> structure / control data and textual content data, it is crucial to fill in
> all textual "content" with the appropriate encoding -- and that one should
> be subject to set up / choose.
> 
> Testing some examples I found that
> - very most characters found in our data are displayed as they should, in
> JSP and XLS (by POI).
> - the czech "s with v on top" is displayed well in JSPs, but not in POI
> generated XLS: There it shows up as "little rectangle".
> I know that in ISO-8859-1 there are also problems with danish "o with slash"
> also, but currently I have no test data. Also I would expect problems with
> turkish letters like "i without dot" or "c with bottom accent", like in the
> city name "Incirlik", when written correctly.
> 
> btw:
> In JXL (JExcelAPI) it is posible to set up an encoding for a generated XLS
> file, which by default is "the default encoding of the hosting VM", but it
> took a while to make that happen.
> 
> 
> Regards
> Christian Gosch
> inovex GmbH
> 
> 
> 
> On Tuesday, November 08, 2005 11:59 PM [GMT+1=CET],
> Olivier Matt <in...@kodee.org> wrote:
> 
> 
>>Hello,
>>
>>I'm reading excel files and I get from a CELL_TYPE_STRING cell a
>>String.
>>
>>That string has some problems with accents (I guess the file is
>>encoded using
>>some latin-characters encoding), they are not seen properly.
>>
>>How can I avoid this behavior ? Can I specify somewhere the encoding
>>of the cells ?
>>Or is there a method for transforming misinterpreted strings to good
>>latin-strings ?
>>
>>
>>Thanks for help,
>>
>>Olivier
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
>>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
> 
> 


-- 
Andrew C. Oliver
SuperLink Software, Inc.

Java to Excel using POI
http://www.superlinksoftware.com/services/poi
Commercial support including features added/implemented, bugs fixed.


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/