You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Inma Marín López <in...@dif.um.es> on 2006/07/28 13:18:06 UTC
Error when parsing ISO-8859-1 encoded documents
Hello all!
I have an xml document which includes special characters, for example,
<Document>
<one>melón</one>
<two>1º</two>
</Document>
And I want to get it in canonical form, so I do the following (using Apache
XML Security and Xerces 2.7.1):
org.apache.xml.security.c14n.Canonicalizer c14n =
org.apache.xml.security.c14n.Canonicalizer.getInstance(
org.apache.xml.security.transforms.Transforms.TRANSFORM_C14N_EXCL_WITH_COMME
NTS);
byte [] canonicalized =
c14n.canonicalize(xmldocument.getBytes());
However, I obtain the following exception:
org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
Source)
at
org.apache.xml.security.c14n.Canonicalizer.canonicalize(Unknown Source)
The xml document is ISO-8859-1 encoded, because I want to keep special
characters (if I encode it in UTF-8, the document turns into the following:
<Document>
<one>mel?n</one>
<two>1?</two>
</Document>
).
Could you be so kind as to tell me how to parse an ISO-8859-1 encoded
document with xerces, please????
Thank you very much in advance.
Inma.
Re: Error when parsing ISO-8859-1 encoded documents
Posted by Michael Glavassevich <mr...@ca.ibm.com>.
Stanimir Stamenkov <st...@myrealbox.com> wrote on 07/28/2006 10:46:24 AM:
<snip/>
> > Could you be so kind as to tell me how to parse an ISO-8859-1 encoded
> > document with xerces, please????
>
> Seems you're trying something but asking a different question. The
> things I've mentioned above still apply. If you don't want or can't
> add an XML Declaration to the document you could feed a parser with
> ready decoded character stream instead of byte stream, like:
>
> InputStream byteStream;
> ...
> Reader charStream = new InputStreamReader(byteStream, "ISO-8859-1");
> InputSource source;
> DocumentBuilder parser; // it could be SAXParser as well
> ...
> source.setCharacterStream(charStream);
> parser.parse(source);
Or set the encoding on the InputSource if you're sure what it is and give
the parser an opportunity to use an optimized reader.
InputSource source;
...
source.setByteStream(byteStream);
source.setEncoding("ISO-8859-1");
parser.parse(source);
> [1] http://www.w3.org/TR/REC-xml/#NT-XMLDecl
> [2] http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
>
> --
> Stanimir
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org
Re: Error when parsing ISO-8859-1 encoded documents
Posted by Stanimir Stamenkov <st...@myrealbox.com>.
/Inma Marín López/:
> I have an xml document which includes special characters, for example,
>
> <Document>
> <one>melón</one>
> <two>1º</two>
> </Document>
>
> And I want to get it in canonical form, so I do the following (using
> Apache XML Security and Xerces 2.7.1):
>
> org.apache.xml.security.c14n.Canonicalizer c14n =
> org.apache.xml.security.c14n.Canonicalizer.getInstance(
> org.apache.xml.security.transforms.Transforms.TRANSFORM_C14N_EXCL_WITH_COMMENTS);
> byte [] canonicalized =
> c14n.canonicalize(xmldocument.getBytes());
>
> However, I obtain the following exception:
>
> org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.
I guess your document should include an XML Declaration [1]:
<?xml version="1.0" encoding="ISO-8859-1 ?>
Because of the rules [2] to detect the character encoding of a
document, missing to include an XML Declaration defaults to using UTF-8.
Alternatively you should supply an UTF-8 sequence to the
|Canonicalizer.canonicalize(byte[])| method. If |xmldocument| is a
|String|:
Canonicalizer c14n;
...
c14n = c14n.canonicalize(xmldocument.getBytes("UTF-8"));
The |String.getBytes()| (no-args) method returns bytes encoding the
text using the platform's default encoding, not necessary "ISO-8859-1".
> The xml document is ISO-8859-1 encoded, because I want to keep special
> characters (if I encode it in UTF-8, the document turns into the following:
>
> <Document>
> <one>mel?n</one>
> <two>1?</two>
> </Document>
How do you encode the document in UTF-8? You're obviously doing
something wrong as Unicode contains the full ISO-8859-1 repertoire
for sure. Are you just decoding the "ISO-8859-1" encoded document
using "UTF-8" where invalid UTF-8 byte sequences get substituted
with '?' (question mark)?
> Could you be so kind as to tell me how to parse an ISO-8859-1 encoded
> document with xerces, please????
Seems you're trying something but asking a different question. The
things I've mentioned above still apply. If you don't want or can't
add an XML Declaration to the document you could feed a parser with
ready decoded character stream instead of byte stream, like:
InputStream byteStream;
...
Reader charStream = new InputStreamReader(byteStream, "ISO-8859-1");
InputSource source;
DocumentBuilder parser; // it could be SAXParser as well
...
source.setCharacterStream(charStream);
parser.parse(source);
[1] http://www.w3.org/TR/REC-xml/#NT-XMLDecl
[2] http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
--
Stanimir
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org