You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-users@xerces.apache.org by Inma Marín López <in...@dif.um.es> on 2006/07/28 13:18:06 UTC

Error when parsing ISO-8859-1 encoded documents

 

Hello all!

 

 I have an xml document which includes special characters, for example,

 

<Document>

            <one>melón</one>

            <two>1º</two>

</Document>

 

And I want to get it in canonical form, so I do the following (using Apache
XML Security and Xerces 2.7.1):

 

            org.apache.xml.security.c14n.Canonicalizer c14n =
org.apache.xml.security.c14n.Canonicalizer.getInstance(

org.apache.xml.security.transforms.Transforms.TRANSFORM_C14N_EXCL_WITH_COMME
NTS);

            byte [] canonicalized =
c14n.canonicalize(xmldocument.getBytes());

 

However, I obtain the following exception:

 

org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.

            at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)

            at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
Source)

            at
org.apache.xml.security.c14n.Canonicalizer.canonicalize(Unknown Source)

 

 

The xml document is ISO-8859-1 encoded, because I want to keep special
characters (if I encode it in UTF-8, the document turns into the following:

 

<Document>

            <one>mel?n</one>

            <two>1?</two>

</Document>

 ).

 

Could you be so kind as to tell me how to parse an ISO-8859-1 encoded
document with xerces, please????

Thank you very much in advance.

 

Inma.

Re: Error when parsing ISO-8859-1 encoded documents

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

Stanimir Stamenkov <st...@myrealbox.com> wrote on 07/28/2006 10:46:24 AM:

<snip/>

> > Could you be so kind as to tell me how to parse an ISO-8859-1 encoded 
> > document with xerces, please????
> 
> Seems you're trying something but asking a different question. The 
> things I've mentioned above still apply. If you don't want or can't 
> add an XML Declaration to the document you could feed a parser with 
> ready decoded character stream instead of byte stream, like:
> 
> InputStream byteStream;
> ...
> Reader charStream = new InputStreamReader(byteStream, "ISO-8859-1");
> InputSource source;
> DocumentBuilder parser;   // it could be SAXParser as well
> ...
> source.setCharacterStream(charStream);
> parser.parse(source);

Or set the encoding on the InputSource if you're sure what it is and give 
the parser an opportunity to use an optimized reader.

InputSource source;
...
source.setByteStream(byteStream);
source.setEncoding("ISO-8859-1");
parser.parse(source);

> [1] http://www.w3.org/TR/REC-xml/#NT-XMLDecl
> [2] http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
> 
> -- 
> Stanimir
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
> For additional commands, e-mail: j-users-help@xerces.apache.org

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org

Re: Error when parsing ISO-8859-1 encoded documents

Posted by Stanimir Stamenkov <st...@myrealbox.com>.

/Inma Marín López/:

>  I have an xml document which includes special characters, for example,
> 
> <Document>
>             <one>melón</one>
>             <two>1º</two>
> </Document>
> 
> And I want to get it in canonical form, so I do the following (using 
> Apache XML Security and Xerces 2.7.1):
> 
>             org.apache.xml.security.c14n.Canonicalizer c14n = 
> org.apache.xml.security.c14n.Canonicalizer.getInstance(
> org.apache.xml.security.transforms.Transforms.TRANSFORM_C14N_EXCL_WITH_COMMENTS);
>             byte [] canonicalized = 
> c14n.canonicalize(xmldocument.getBytes());
> 
> However, I obtain the following exception:
> 
> org.xml.sax.SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence.

I guess your document should include an XML Declaration [1]:

<?xml version="1.0" encoding="ISO-8859-1 ?>

Because of the rules [2] to detect the character encoding of a 
document, missing to include an XML Declaration defaults to using UTF-8.

Alternatively you should supply an UTF-8 sequence to the 
|Canonicalizer.canonicalize(byte[])| method. If |xmldocument| is a 
|String|:

Canonicalizer c14n;
...
c14n = c14n.canonicalize(xmldocument.getBytes("UTF-8"));

The |String.getBytes()| (no-args) method returns bytes encoding the 
text using the platform's default encoding, not necessary "ISO-8859-1".

> The xml document is ISO-8859-1 encoded, because I want to keep special 
> characters (if I encode it in UTF-8, the document turns into the following:
> 
> <Document>
>             <one>mel?n</one>
>             <two>1?</two>
> </Document>

How do you encode the document in UTF-8? You're obviously doing 
something wrong as Unicode contains the full ISO-8859-1 repertoire 
for sure. Are you just decoding the "ISO-8859-1" encoded document 
using "UTF-8" where invalid UTF-8 byte sequences get substituted 
with '?' (question mark)?

> Could you be so kind as to tell me how to parse an ISO-8859-1 encoded 
> document with xerces, please????

Seems you're trying something but asking a different question. The 
things I've mentioned above still apply. If you don't want or can't 
add an XML Declaration to the document you could feed a parser with 
ready decoded character stream instead of byte stream, like:

InputStream byteStream;
...
Reader charStream = new InputStreamReader(byteStream, "ISO-8859-1");
InputSource source;
DocumentBuilder parser;   // it could be SAXParser as well
...
source.setCharacterStream(charStream);
parser.parse(source);


[1] http://www.w3.org/TR/REC-xml/#NT-XMLDecl
[2] http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info

-- 
Stanimir

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-users-help@xerces.apache.org