You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Inma Marín López <in...@dif.um.es> on 2007/08/02 09:53:17 UTC

Problems with ISO-8859-1 and UTF-8 encodings

Hi all,

 

 I have some problems with ISO-5589-1 and UTF-8 encodings in XML documents. Concretely, I have this ISO-8859-1 - encoded XML document:

 

<?xml version="1.0" encoding="ISO-8859-1"?>

<DOCUMENTO>

<PERFILES>Á</PERFILES>

<PERFILES>É</PERFILES>

<PERFILES>Í</PERFILES>

<PERFILES>Ó</PERFILES>

<PERFILES>Ú</PERFILES>

</DOCUMENTO> 

 

Then I UTF-8 - encode it, by means of the following piece of code:

 

            Transformer transformer = TransformerFactory.newInstance().newTransformer();

            StreamSource ds = new StreamSource(new ByteArrayInputStream(xmliso88191.getBytes()));

            transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");

            ByteArrayOutputStream baos = new ByteArrayOutputStream();

            transformer.transform(ds,new StreamResult(baos));

            return baos.toString();

 

to obtain this XML document:

 

<?xml version="1.0" encoding="utf-8"?>

<DOCUMENTO>

<PERFILES>Ã?</PERFILES>

<PERFILES>É</PERFILES>

<PERFILES>Ã?</PERFILES>

<PERFILES>Ó</PERFILES>

<PERFILES>Ú</PERFILES>

</DOCUMENTO>

 

Next, I ISO-8859-1- encode this document (UTF-8 encoded):

 

            Transformer transformer = TransformerFactory.newInstance().newTransformer();

            StreamSource ds = new StreamSource(new ByteArrayInputStream(xmlutf8.getBytes()));

            transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");

            ByteArrayOutputStream baos = new ByteArrayOutputStream();

            transformer.transform(ds,new StreamResult(baos));

            return baos.toString();

 

But I can not get it. Instead, I obtain the following exception:

 

[Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.

javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: Invali

 byte 2 of 2-byte UTF-8 sequence.

        at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans

ormerIdentityImpl.java:449)

        at codificacion.PruebasCodificacion.encodeISO88891(PruebasCodificacion.

ava:302)

        at codificacion.PruebasCodificacion.prueba(PruebasCodificacion.java:73)

        at codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)

Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequen

e.

        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

        at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans

ormerIdentityImpl.java:432)

 

 

Is this process correct? Supposing that it is, it seems the exception is due to ‘Ã?’ characters  (‘Á’ and ‘Í’ UTF-8 – encoding), so I would like to know how I could UTF-8 - encode ‘Á’ and ‘Í’ characters and then, back them to ISO-8859-1 encoding.

 

Could anybody be so kind as to help me, please?

 

Thank you very much in advance.

Inma.

 


RE: Problems with ISO-8859-1 and UTF-8 encodings

Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Inma,

xmlutf8.getBytes() doesn't return what you think. Both 
ByteArrayOutputStream.toString() [1] and String.getBytes() [2] use the 
default encoding (which is probably ISO-8859-1 on your system) for 
converting between bytes -> chars and chars -> bytes. You can fix this by 
specifying the encoding on these methods, but if I were you I'd avoid 
doing the conversions altogether and just create the 
StreamSource/StreamResult with a java.io.StringReader/java.io.StringWriter 
instead.

Thanks.

[1] 
http://java.sun.com/javase/6/docs/api/java/io/ByteArrayOutputStream.html#toString()
[2] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes()

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

Robert Houben <Ro...@fusionware.net> wrote on 08/02/2007 11:36:34 
AM:

> Hi Inma,
> 
> The last line of your first block you have:
> return baos.toString();
> Note that when you do ?toString()? on the byte array it will return 
> a string in Java internal form, not UTF8.  I?m guessing that in your
> next block of code, xmlutf8 is the result of the first block.  This 
> means that when you getBytes() from it, you are getting bytes that 
> are no longer in UTF8 form.
> 
> HTH,
> 
> From: Inma Marín López [mailto:inma@dif.um.es] 
> Sent: Thursday, August 02, 2007 12:53 AM
> To: j-users@xerces.apache.org
> Subject: Problems with ISO-8859-1 and UTF-8 encodings
> 
> Hi all,
> 
>  I have some problems with ISO-5589-1 and UTF-8 encodings in XML 
> documents. Concretely, I have this ISO-8859-1 - encoded XML document:
> 
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <DOCUMENTO>
> <PERFILES>Á</PERFILES>
> <PERFILES>É</PERFILES>
> <PERFILES>Í</PERFILES>
> <PERFILES>Ó</PERFILES>
> <PERFILES>Ú</PERFILES>
> </DOCUMENTO> 
> 
> Then I UTF-8 - encode it, by means of the following piece of code:
> 
>             Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
>             StreamSource ds = new StreamSource(new 
> ByteArrayInputStream(xmliso88191.getBytes()));
>             transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");
>             ByteArrayOutputStream baos = new ByteArrayOutputStream();
>             transformer.transform(ds,new StreamResult(baos));
>             return baos.toString();
> 
> to obtain this XML document:
> 
> <?xml version="1.0" encoding="utf-8"?>
> <DOCUMENTO>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> </DOCUMENTO>
> 
> Next, I ISO-8859-1- encode this document (UTF-8 encoded):
> 
>             Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
>             StreamSource ds = new StreamSource(new 
> ByteArrayInputStream(xmlutf8.getBytes()));
> transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
>             ByteArrayOutputStream baos = new ByteArrayOutputStream();
>             transformer.transform(ds,new StreamResult(baos));
>             return baos.toString();
> 
> But I can not get it. Instead, I obtain the following exception:
> 
> [Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.
> javax.xml.transform.TransformerException: org.xml.sax.
> SAXParseException: Invali
>  byte 2 of 2-byte UTF-8 sequence.
>         at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:449)
>         at codificacion.PruebasCodificacion.
> encodeISO88891(PruebasCodificacion.
> ava:302)
>         at codificacion.PruebasCodificacion.
> prueba(PruebasCodificacion.java:73)
>         at 
codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)
> Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte 
> UTF-8 sequen
> e.
>         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown 
Source)
>         at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:432)
> 
> 
> Is this process correct? Supposing that it is, it seems the 
> exception is due to ?Ã?? characters  (?Á? and ?Í? UTF-8 ? encoding),
> so I would like to know how I could UTF-8 - encode ?Á? and ?Í? 
> characters and then, back them to ISO-8859-1 encoding.
> 
> Could anybody be so kind as to help me, please?
> 
> Thank you very much in advance.
> Inma.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org


RE: Problems with ISO-8859-1 and UTF-8 encodings

Posted by Robert Houben <Ro...@fusionware.net>.
Hi Inma,

The last line of your first block you have:
return baos.toString();
Note that when you do “toString()” on the byte array it will return a string in Java internal form, not UTF8.  I’m guessing that in your next block of code, xmlutf8 is the result of the first block.  This means that when you getBytes() from it, you are getting bytes that are no longer in UTF8 form.

HTH,

From: Inma Marín López [mailto:inma@dif.um.es]
Sent: Thursday, August 02, 2007 12:53 AM
To: j-users@xerces.apache.org
Subject: Problems with ISO-8859-1 and UTF-8 encodings

Hi all,

 I have some problems with ISO-5589-1 and UTF-8 encodings in XML documents. Concretely, I have this ISO-8859-1 - encoded XML document:

<?xml version="1.0" encoding="ISO-8859-1"?>
<DOCUMENTO>
<PERFILES>Á</PERFILES>
<PERFILES>É</PERFILES>
<PERFILES>Í</PERFILES>
<PERFILES>Ó</PERFILES>
<PERFILES>Ú</PERFILES>
</DOCUMENTO>

Then I UTF-8 - encode it, by means of the following piece of code:

            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            StreamSource ds = new StreamSource(new ByteArrayInputStream(xmliso88191.getBytes()));
            transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            transformer.transform(ds,new StreamResult(baos));
            return baos.toString();

to obtain this XML document:

<?xml version="1.0" encoding="utf-8"?>
<DOCUMENTO>
<PERFILES>Ã?</PERFILES>
<PERFILES>É</PERFILES>
<PERFILES>Ã?</PERFILES>
<PERFILES>Ó</PERFILES>
<PERFILES>Ú</PERFILES>
</DOCUMENTO>

Next, I ISO-8859-1- encode this document (UTF-8 encoded):

            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            StreamSource ds = new StreamSource(new ByteArrayInputStream(xmlutf8.getBytes()));
            transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            transformer.transform(ds,new StreamResult(baos));
            return baos.toString();

But I can not get it. Instead, I obtain the following exception:

[Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.
javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: Invali
 byte 2 of 2-byte UTF-8 sequence.
        at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans
ormerIdentityImpl.java:449)
        at codificacion.PruebasCodificacion.encodeISO88891(PruebasCodificacion.
ava:302)
        at codificacion.PruebasCodificacion.prueba(PruebasCodificacion.java:73)
        at codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequen
e.
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans
ormerIdentityImpl.java:432)


Is this process correct? Supposing that it is, it seems the exception is due to ‘Ã?’ characters  (‘Á’ and ‘Í’ UTF-8 – encoding), so I would like to know how I could UTF-8 - encode ‘Á’ and ‘Í’ characters and then, back them to ISO-8859-1 encoding.

Could anybody be so kind as to help me, please?

Thank you very much in advance.
Inma.