You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Steven Citron-Pousty <St...@yale.edu> on 2004/07/30 23:17:33 UTC

Problem with encoding - not typical

Greetings o-helpful-poiers:
Alright so here goes, I am having an encoding issue and not a straight 
forward one.
I am working on win xp pro, English U.S. with no language packs 
installed and with all the service packs and fixes. I am using  jdk 1.4.2_03

I have this text in my Excel file:

EMBARCACIONES INSCRITAS EN EL REGISTRO NACIONAL DE PESCA DEL SECTOR 
SOCIAL EN PESCA RIBEREÑA SEGÚN PESQUERÍA Al 31 de diciembre de 1993

I get it out of the spreadsheet using POI and then want to print it in 
UTF-8 so I call this function

 private String convertToUTF8(String incoming){
        String outgoing = null;
        try {
           if(incoming != null){
                byte[] incomingBytes = incoming.getBytes("UTF-8");
                outgoing = new String(incomingBytes);
           } else {
               outgoing = "";
           }

        } catch (UnsupportedEncodingException e) {
            logger.error(" threw an exception trying to write utf8:" 
+e.getMessage());
        }
        return outgoing;
    }

And what I get when I put it in XML is this:

<titl> CUADRO 4.1.4.4 - EMBARCACIONES INSCRITAS EN EL REGISTRO NACIONAL DE PESCA DEL SECTOR SOCIAL EN PESCA RIBEREÑA SEGÚN PESQUERÿA Al 31 de diciembre de 1993</titl>

Notice what happened to the Í, it got turned into a ÿ


I checked the binary hex in the excel file at that location and its 
value is CD which is the right binary hex for Í.

So I don't understand why some characters are being converted and just 
that one character is not. And its consistent among spreadsheets that 
the Í is not converted while other accented characters are.
I tried using some of the code from this page
http://www.jguru.com/faq/view.jsp?EID=137049
but then none of the character are being rendered correctly.

Here is how I am getting the cell's contents with POI

private String cellToString(HSSFCell cell){
        String value = null;

        if (cell == null ) {
            value = "";
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING){
            value = cell.getStringCellValue();
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC){
            value = Double.toString(cell.getNumericCellValue());
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_BLANK) {
            value = "";
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_BOOLEAN){
            boolean bvalue = cell.getBooleanCellValue();
            value = Boolean.toString(bvalue);
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
            value = cell.getStringCellValue();
        } else {
            value = "";
        }

        return value;
    }


Any help or suggestions would be greatly appreciated since I am not sure 
about how to dig into the POI library to see what is happening there.
Thanks,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org


RE: Problem with encoding - not typical

Posted by Michael Zalewski <za...@optonline.net>.
The problem is that the string is encoded in CP 1252 (for US Windows), but
your code is interpreting the byte array in UTF 8.

This won't work for any character in the range 0x080 - 0x09f (which is why
there is problems with the Euro symbol)

This also won't work for any character in the range 0x0c0 - 0x0ff (because
these are UTF 8 escape sequences for 2 byte characters)

I don't know why you get y umlaut. But I can tell you this much:

1)	The source characters ÍA get represented in CP 1252 as 0x0cd 0x041. That
is an invalid sequence in UTF 8. The 0x0cd should begin a two byte encoding.
The second byte should have its high order bit set.
2)	A lower case y umlaut is 0x0ff in CP 1252.

You can look at http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html for
a better explanation.

-----Original Message-----
From: Steven Citron-Pousty [mailto:Steven.Citron-Pousty@yale.edu]
Sent: Friday, July 30, 2004 5:18 PM
To: POI Users List
Subject: Problem with encoding - not typical

Greetings o-helpful-poiers:
Alright so here goes, I am having an encoding issue and not a straight
forward one.
I am working on win xp pro, English U.S. with no language packs
installed and with all the service packs and fixes. I am using  jdk 1.4.2_03

I have this text in my Excel file:

EMBARCACIONES INSCRITAS EN EL REGISTRO NACIONAL DE PESCA DEL SECTOR
SOCIAL EN PESCA RIBEREÑA SEGÚN PESQUERÍA Al 31 de diciembre de 1993

I get it out of the spreadsheet using POI and then want to print it in
UTF-8 so I call this function

 private String convertToUTF8(String incoming){
        String outgoing = null;
        try {
           if(incoming != null){
                byte[] incomingBytes = incoming.getBytes("UTF-8");
                outgoing = new String(incomingBytes);
           } else {
               outgoing = "";
           }

        } catch (UnsupportedEncodingException e) {
            logger.error(" threw an exception trying to write utf8:"
+e.getMessage());
        }
        return outgoing;
    }

And what I get when I put it in XML is this:

<titl> CUADRO 4.1.4.4 - EMBARCACIONES INSCRITAS EN EL REGISTRO NACIONAL DE
PESCA DEL SECTOR SOCIAL EN PESCA RIBEREÑA SEGÚN PESQUERÿA Al 31 de diciembre
de 1993</titl>

Notice what happened to the Í, it got turned into a ÿ


I checked the binary hex in the excel file at that location and its
value is CD which is the right binary hex for Í.

So I don't understand why some characters are being converted and just
that one character is not. And its consistent among spreadsheets that
the Í is not converted while other accented characters are.
I tried using some of the code from this page
http://www.jguru.com/faq/view.jsp?EID=137049
but then none of the character are being rendered correctly.

Here is how I am getting the cell's contents with POI

private String cellToString(HSSFCell cell){
        String value = null;

        if (cell == null ) {
            value = "";
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_STRING){
            value = cell.getStringCellValue();
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC){
            value = Double.toString(cell.getNumericCellValue());
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_BLANK) {
            value = "";
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_BOOLEAN){
            boolean bvalue = cell.getBooleanCellValue();
            value = Boolean.toString(bvalue);
        } else if (cell.getCellType() == HSSFCell.CELL_TYPE_FORMULA){
            value = cell.getStringCellValue();
        } else {
            value = "";
        }

        return value;
    }


Any help or suggestions would be greatly appreciated since I am not sure
about how to dig into the POI library to see what is happening there.
Thanks,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: poi-user-help@jakarta.apache.org