You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Inma Marín López <in...@dif.um.es> on 2007/08/02 09:53:17 UTC
Problems with ISO-8859-1 and UTF-8 encodings
Hi all,
I have some problems with ISO-5589-1 and UTF-8 encodings in XML documents. Concretely, I have this ISO-8859-1 - encoded XML document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<DOCUMENTO>
<PERFILES>Á</PERFILES>
<PERFILES>É</PERFILES>
<PERFILES>Í</PERFILES>
<PERFILES>Ó</PERFILES>
<PERFILES>Ú</PERFILES>
</DOCUMENTO>
Then I UTF-8 - encode it, by means of the following piece of code:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamSource ds = new StreamSource(new ByteArrayInputStream(xmliso88191.getBytes()));
transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
transformer.transform(ds,new StreamResult(baos));
return baos.toString();
to obtain this XML document:
<?xml version="1.0" encoding="utf-8"?>
<DOCUMENTO>
<PERFILES>Ã?</PERFILES>
<PERFILES>É</PERFILES>
<PERFILES>Ã?</PERFILES>
<PERFILES>Ó</PERFILES>
<PERFILES>Ú</PERFILES>
</DOCUMENTO>
Next, I ISO-8859-1- encode this document (UTF-8 encoded):
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamSource ds = new StreamSource(new ByteArrayInputStream(xmlutf8.getBytes()));
transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
transformer.transform(ds,new StreamResult(baos));
return baos.toString();
But I can not get it. Instead, I obtain the following exception:
[Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.
javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: Invali
byte 2 of 2-byte UTF-8 sequence.
at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans
ormerIdentityImpl.java:449)
at codificacion.PruebasCodificacion.encodeISO88891(PruebasCodificacion.
ava:302)
at codificacion.PruebasCodificacion.prueba(PruebasCodificacion.java:73)
at codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequen
e.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans
ormerIdentityImpl.java:432)
Is this process correct? Supposing that it is, it seems the exception is due to ‘Ã?’ characters (‘Á’ and ‘Í’ UTF-8 – encoding), so I would like to know how I could UTF-8 - encode ‘Á’ and ‘Í’ characters and then, back them to ISO-8859-1 encoding.
Could anybody be so kind as to help me, please?
Thank you very much in advance.
Inma.
RE: Problems with ISO-8859-1 and UTF-8 encodings
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Hi Inma,
xmlutf8.getBytes() doesn't return what you think. Both
ByteArrayOutputStream.toString() [1] and String.getBytes() [2] use the
default encoding (which is probably ISO-8859-1 on your system) for
converting between bytes -> chars and chars -> bytes. You can fix this by
specifying the encoding on these methods, but if I were you I'd avoid
doing the conversions altogether and just create the
StreamSource/StreamResult with a java.io.StringReader/java.io.StringWriter
instead.
Thanks.
[1]
http://java.sun.com/javase/6/docs/api/java/io/ByteArrayOutputStream.html#toString()
[2] http://java.sun.com/javase/6/docs/api/java/lang/String.html#getBytes()
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
Robert Houben <Ro...@fusionware.net> wrote on 08/02/2007 11:36:34
AM:
> Hi Inma,
>
> The last line of your first block you have:
> return baos.toString();
> Note that when you do ?toString()? on the byte array it will return
> a string in Java internal form, not UTF8. I?m guessing that in your
> next block of code, xmlutf8 is the result of the first block. This
> means that when you getBytes() from it, you are getting bytes that
> are no longer in UTF8 form.
>
> HTH,
>
> From: Inma Marín López [mailto:inma@dif.um.es]
> Sent: Thursday, August 02, 2007 12:53 AM
> To: j-users@xerces.apache.org
> Subject: Problems with ISO-8859-1 and UTF-8 encodings
>
> Hi all,
>
> I have some problems with ISO-5589-1 and UTF-8 encodings in XML
> documents. Concretely, I have this ISO-8859-1 - encoded XML document:
>
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <DOCUMENTO>
> <PERFILES>Á</PERFILES>
> <PERFILES>É</PERFILES>
> <PERFILES>Í</PERFILES>
> <PERFILES>Ó</PERFILES>
> <PERFILES>Ú</PERFILES>
> </DOCUMENTO>
>
> Then I UTF-8 - encode it, by means of the following piece of code:
>
> Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
> StreamSource ds = new StreamSource(new
> ByteArrayInputStream(xmliso88191.getBytes()));
> transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> transformer.transform(ds,new StreamResult(baos));
> return baos.toString();
>
> to obtain this XML document:
>
> <?xml version="1.0" encoding="utf-8"?>
> <DOCUMENTO>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> <PERFILES>Ã?</PERFILES>
> </DOCUMENTO>
>
> Next, I ISO-8859-1- encode this document (UTF-8 encoded):
>
> Transformer transformer = TransformerFactory.
> newInstance().newTransformer();
> StreamSource ds = new StreamSource(new
> ByteArrayInputStream(xmlutf8.getBytes()));
> transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
> ByteArrayOutputStream baos = new ByteArrayOutputStream();
> transformer.transform(ds,new StreamResult(baos));
> return baos.toString();
>
> But I can not get it. Instead, I obtain the following exception:
>
> [Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.
> javax.xml.transform.TransformerException: org.xml.sax.
> SAXParseException: Invali
> byte 2 of 2-byte UTF-8 sequence.
> at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:449)
> at codificacion.PruebasCodificacion.
> encodeISO88891(PruebasCodificacion.
> ava:302)
> at codificacion.PruebasCodificacion.
> prueba(PruebasCodificacion.java:73)
> at
codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)
> Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte
> UTF-8 sequen
> e.
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown
Source)
> at org.apache.xalan.transformer.TransformerIdentityImpl.
> transform(Trans
> ormerIdentityImpl.java:432)
>
>
> Is this process correct? Supposing that it is, it seems the
> exception is due to ?Ã?? characters (?Á? and ?Í? UTF-8 ? encoding),
> so I would like to know how I could UTF-8 - encode ?Á? and ?Í?
> characters and then, back them to ISO-8859-1 encoding.
>
> Could anybody be so kind as to help me, please?
>
> Thank you very much in advance.
> Inma.
>
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
RE: Problems with ISO-8859-1 and UTF-8 encodings
Posted by Robert Houben <Ro...@fusionware.net>.
Hi Inma,
The last line of your first block you have:
return baos.toString();
Note that when you do “toString()” on the byte array it will return a string in Java internal form, not UTF8. I’m guessing that in your next block of code, xmlutf8 is the result of the first block. This means that when you getBytes() from it, you are getting bytes that are no longer in UTF8 form.
HTH,
From: Inma Marín López [mailto:inma@dif.um.es]
Sent: Thursday, August 02, 2007 12:53 AM
To: j-users@xerces.apache.org
Subject: Problems with ISO-8859-1 and UTF-8 encodings
Hi all,
I have some problems with ISO-5589-1 and UTF-8 encodings in XML documents. Concretely, I have this ISO-8859-1 - encoded XML document:
<?xml version="1.0" encoding="ISO-8859-1"?>
<DOCUMENTO>
<PERFILES>Á</PERFILES>
<PERFILES>É</PERFILES>
<PERFILES>Í</PERFILES>
<PERFILES>Ó</PERFILES>
<PERFILES>Ú</PERFILES>
</DOCUMENTO>
Then I UTF-8 - encode it, by means of the following piece of code:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamSource ds = new StreamSource(new ByteArrayInputStream(xmliso88191.getBytes()));
transformer.setOutputProperty(OutputKeys.ENCODING,"utf-8");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
transformer.transform(ds,new StreamResult(baos));
return baos.toString();
to obtain this XML document:
<?xml version="1.0" encoding="utf-8"?>
<DOCUMENTO>
<PERFILES>Ã?</PERFILES>
<PERFILES>É</PERFILES>
<PERFILES>Ã?</PERFILES>
<PERFILES>Ó</PERFILES>
<PERFILES>Ú</PERFILES>
</DOCUMENTO>
Next, I ISO-8859-1- encode this document (UTF-8 encoded):
Transformer transformer = TransformerFactory.newInstance().newTransformer();
StreamSource ds = new StreamSource(new ByteArrayInputStream(xmlutf8.getBytes()));
transformer.setOutputProperty(OutputKeys.ENCODING,"iso-8859-1");
ByteArrayOutputStream baos = new ByteArrayOutputStream();
transformer.transform(ds,new StreamResult(baos));
return baos.toString();
But I can not get it. Instead, I obtain the following exception:
[Fatal Error] :8:11: Invalid byte 2 of 2-byte UTF-8 sequence.
javax.xml.transform.TransformerException: org.xml.sax.SAXParseException: Invali
byte 2 of 2-byte UTF-8 sequence.
at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans
ormerIdentityImpl.java:449)
at codificacion.PruebasCodificacion.encodeISO88891(PruebasCodificacion.
ava:302)
at codificacion.PruebasCodificacion.prueba(PruebasCodificacion.java:73)
at codificacion.PruebasCodificacion.main(PruebasCodificacion.java:356)
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequen
e.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xalan.transformer.TransformerIdentityImpl.transform(Trans
ormerIdentityImpl.java:432)
Is this process correct? Supposing that it is, it seems the exception is due to ‘Ã?’ characters (‘Á’ and ‘Í’ UTF-8 – encoding), so I would like to know how I could UTF-8 - encode ‘Á’ and ‘Í’ characters and then, back them to ISO-8859-1 encoding.
Could anybody be so kind as to help me, please?
Thank you very much in advance.
Inma.