You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by Morten <li...@kikobu.com> on 2003/11/03 14:41:27 UTC

Platform dependent encoding (problems running on linux)

Hi. I've been developing an application for extracting data
from Excel documents and inserting into a DB. When the application
runs on windows, it works fine. Just now, I've moved to to Linux,
and this appears to break the encoding. I'm curious if anyone
here has had similar experiences.

The string value in the below, is from SSTRecord.getString(int i);
The UTB-8 byte[] is a byte-per-byte dump of the byte[] obtained
by SSTRecord.getString(int i).getBytes("UTF-8");

Log from windows:

14:32:33.213 03/11/2003 DEBUG: Processing record (31,1): Leer más
14:32:33.213 03/11/2003 DEBUG:   - deflt byte[]: 
76,101,101,114,32,109,-31,115
14:32:33.213 03/11/2003 DEBUG:   - UTF8  byte[]: 
76,101,101,114,32,109,-61,-95,115
14:32:33.213 03/11/2003 DEBUG:   - UTF16 byte[]: 
-2,-1,0,76,0,101,0,101,0,114,0,32,0,109,0,-31,0,115

Log from linux:

14:32:15.861 03/11/2003 DEBUG: Processing record (31,1): Leer mï¿½
14:32:15.861 03/11/2003 DEBUG:   - deflt byte[]: 
76,101,101,114,32,109,-17,-65,-67
14:32:15.861 03/11/2003 DEBUG:   - UTF8  byte[]: 
76,101,101,114,32,109,-17,-65,-67
14:32:15.861 03/11/2003 DEBUG:   - UTF16 byte[]: 
-2,-1,0,76,0,101,0,101,0,114,0,32,0,109,-1,-3

As you can see, the byte[]'s are different from platform to platform. :-|

Any tips greatly appreciated.

Thanks,

Morten




---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Platform dependent encoding (problems running on linux)

Posted by Morten <li...@kikobu.com>.

Avik Sengupta wrote:

> Hi, 
> 
> First up, which version are you using? SSTRecord and associated have
> changed quite a bit over the last few versions, particularly with
> respect to double byte char handling. 

Ok, I'll try 2.0 RC1, was using 1.5.1.

> Also, check the default encoding in your platform/shell. That sometimes
> messes up stuff (it shouldnt, this is only a workaround to a bug..). in
> RedHat 8 for eg, default encoding is UTF-8. Change it to ISO8859-1 (LANG
> env), and see if it helps. Check what is the encoding in windows (i dont
> know how .... I think there is a java system property that can tell
> you)....

file.encoding, on Linux (RedHat 8) it's UTF-8 and on windows it's 
Cp1252. Iso-8859-1 is not good enough though, I need to be able to
work with arabic also.

> In summary, I suspect that your problem may be solved by upgrading, or,
> as a workaround, by setting the proper default encoding. 

I'll try upgrading for now and get back with an update - thanks for the
tips :)

Br,

Morten




---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Platform dependent encoding (problems running on linux)

Posted by Morten <li...@kikobu.com>.

Wee. It works beautifully with 2.0 :) Thanks to both of you for 
suggestions, it's much appreciated!

Morten




---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Platform dependent encoding (problems running on linux)

Posted by Avik Sengupta <av...@apache.org>.

Hi, 

First up, which version are you using? SSTRecord and associated have
changed quite a bit over the last few versions, particularly with
respect to double byte char handling. 

Also, check the default encoding in your platform/shell. That sometimes
messes up stuff (it shouldnt, this is only a workaround to a bug..). in
RedHat 8 for eg, default encoding is UTF-8. Change it to ISO8859-1 (LANG
env), and see if it helps. Check what is the encoding in windows (i dont
know how .... I think there is a java system property that can tell
you)....

In summary, I suspect that your problem may be solved by upgrading, or,
as a workaround, by setting the proper default encoding. 

HTH
-
Avik


On Mon, 2003-11-03 at 19:11, Morten wrote:
> Hi. I've been developing an application for extracting data
> from Excel documents and inserting into a DB. When the application
> runs on windows, it works fine. Just now, I've moved to to Linux,
> and this appears to break the encoding. I'm curious if anyone
> here has had similar experiences.
> 
> The string value in the below, is from SSTRecord.getString(int i);
> The UTB-8 byte[] is a byte-per-byte dump of the byte[] obtained
> by SSTRecord.getString(int i).getBytes("UTF-8");
> 
> Log from windows:
> 
> 14:32:33.213 03/11/2003 DEBUG: Processing record (31,1): Leer más
> 14:32:33.213 03/11/2003 DEBUG:   - deflt byte[]: 
> 76,101,101,114,32,109,-31,115
> 14:32:33.213 03/11/2003 DEBUG:   - UTF8  byte[]: 
> 76,101,101,114,32,109,-61,-95,115
> 14:32:33.213 03/11/2003 DEBUG:   - UTF16 byte[]: 
> -2,-1,0,76,0,101,0,101,0,114,0,32,0,109,0,-31,0,115
> 
> Log from linux:
> 
> 14:32:15.861 03/11/2003 DEBUG: Processing record (31,1): Leer mï¿½
> 14:32:15.861 03/11/2003 DEBUG:   - deflt byte[]: 
> 76,101,101,114,32,109,-17,-65,-67
> 14:32:15.861 03/11/2003 DEBUG:   - UTF8  byte[]: 
> 76,101,101,114,32,109,-17,-65,-67
> 14:32:15.861 03/11/2003 DEBUG:   - UTF16 byte[]: 
> -2,-1,0,76,0,101,0,101,0,114,0,32,0,109,-1,-3
> 
> As you can see, the byte[]'s are different from platform to platform. :-|
> 
> Any tips greatly appreciated.
> 
> Thanks,
> 
> Morten
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Platform dependent encoding (problems running on linux)

Posted by Avik Sengupta <av...@itellix.com>.

>POI must use the system default
Yeah, and that was a  bug. Quite a lot of cleanup was done before
2.0pre2 and pre3 to ensure that it does not use platform default.
However, this is an issue whenever strings are created from byte arrays,
so there might be some such code left over.. but i think SSTRecord
should be clean now. It certainly wasnt in 1.5.1

Morten, re your comment on ISO-8859-1 wont do, actually cp1252 is quite
the same! see http://www.kostis.net/charsets/cp1252.htm for example...
but anyways, i think your problem should go away when you upgrade. 

Regards
-
Avik

PS: as i said above, SST record should be double byte clean, so cell
values should be fine. We've also cleaned up named ranges, and string
formula results, which also should be fine for double byte chars.
However, sheet names are doubtful even in 2.0RC1 .. you need to test
those. 

On Mon, 2003-11-03 at 19:56, Ryan Ackley wrote:
> > I know that Java was meant to be cross-platform, but data is still
> > subject to encoding by I/O and developer manipulation.
> 
> Sounds like Avik was closer than me. POI must use the system default
> encoding when it creates Strings.
> 
> Ryan
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Platform dependent encoding (problems running on linux)

Posted by Ryan Ackley <sa...@cfl.rr.com>.

> I know that Java was meant to be cross-platform, but data is still
> subject to encoding by I/O and developer manipulation.

Sounds like Avik was closer than me. POI must use the system default
encoding when it creates Strings.

Ryan


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Platform dependent encoding (problems running on linux)

Posted by Morten <li...@kikobu.com>.

Ryan Ackley wrote:

> Not only different, but also different lengths. To me, that points to
> different files or some type of corruption when you transferred the files.

Upload files are identical. Both applications are web-app's that run in
Tomcat 4.1.27 with identical setups.

> I don't think its an encoding problem. The bytes are actually different. If
> you use the same encoding on two different platforms with the same bytes and
> you don't get the same results that would be a bug in Java not POI. You see,
> Java was invented to prevent that problem

The application works like this: A user uploads a zip file containing 
excel documents. The application uncompresses the zip file, and then 
extracts data from the excel documents using POI and inserts it into a
DB.

If I look at the uncompressed excel files on the Linux machine, they are
fine. So in my eyes, POI 1.5.1 relies on some encoding setting that may
not be present, and decodes the excel binary data depending on that.

I know that Java was meant to be cross-platform, but data is still
subject to encoding by I/O and developer manipulation.

Br,

Morten

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org

Re: Platform dependent encoding (problems running on linux)

Posted by Ryan Ackley <sa...@cfl.rr.com>.

Not only different, but also different lengths. To me, that points to
different files or some type of corruption when you transferred the files.

I don't think its an encoding problem. The bytes are actually different. If
you use the same encoding on two different platforms with the same bytes and
you don't get the same results that would be a bug in Java not POI. You see,
Java was invented to prevent that problem

Ryan

----- Original Message ----- 
From: "Morten" <li...@kikobu.com>
To: <po...@jakarta.apache.org>
Sent: Monday, November 03, 2003 8:41 AM
Subject: Platform dependent encoding (problems running on linux)


>
> Hi. I've been developing an application for extracting data
> from Excel documents and inserting into a DB. When the application
> runs on windows, it works fine. Just now, I've moved to to Linux,
> and this appears to break the encoding. I'm curious if anyone
> here has had similar experiences.
>
> The string value in the below, is from SSTRecord.getString(int i);
> The UTB-8 byte[] is a byte-per-byte dump of the byte[] obtained
> by SSTRecord.getString(int i).getBytes("UTF-8");
>
> Log from windows:
>
> 14:32:33.213 03/11/2003 DEBUG: Processing record (31,1): Leer más
> 14:32:33.213 03/11/2003 DEBUG:   - deflt byte[]:
> 76,101,101,114,32,109,-31,115
> 14:32:33.213 03/11/2003 DEBUG:   - UTF8  byte[]:
> 76,101,101,114,32,109,-61,-95,115
> 14:32:33.213 03/11/2003 DEBUG:   - UTF16 byte[]:
> -2,-1,0,76,0,101,0,101,0,114,0,32,0,109,0,-31,0,115
>
> Log from linux:
>
> 14:32:15.861 03/11/2003 DEBUG: Processing record (31,1): Leer mï¿½
> 14:32:15.861 03/11/2003 DEBUG:   - deflt byte[]:
> 76,101,101,114,32,109,-17,-65,-67
> 14:32:15.861 03/11/2003 DEBUG:   - UTF8  byte[]:
> 76,101,101,114,32,109,-17,-65,-67
> 14:32:15.861 03/11/2003 DEBUG:   - UTF16 byte[]:
> -2,-1,0,76,0,101,0,101,0,114,0,32,0,109,-1,-3
>
> As you can see, the byte[]'s are different from platform to platform. :-|
>
> Any tips greatly appreciated.
>
> Thanks,
>
> Morten
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org