You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by jd...@msn.com on 2011/12/06 05:10:44 UTC

Valid XML comment throwing SAXParseException

Hello, I deal with Japanese text quite a bit and was recently parsing a file that contained the Unicode character U+2000B (http://www.fileformat.info/info/unicode/char/2000B/index.htm) in a comment. This character appears to have caused a SAXParseException to be thrown: [Fatal Error] :484236:25: An invalid XML character (Unicode: 0xd840) was found in the comment.org.xml.sax.SAXParseException; lineNumber: 484236; columnNumber: 25; An invalid XML character (Unicode: 0xd840) was found in the comment.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:254)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300) In this particular case, I was attempting to parse Jim Breen's publicly available Kanji dictionary file. This file is used quite extensively in many Japanese/English open-source dictionaries. I exchanged a few emails with Jim and he is confident that the XML is valid. I've reviewed the "Characters" section of the W3C XML 1.0 spec (http://www.w3.org/TR/2004/REC-xml-20040204/#charsets) and honestly can not tell for certain if U+2000B is valid in a comment. Basically Jim's file has an entry for each kanji and a comment prior to each entry that looks like this:  where X is the actual character. If I remove all such comments, the file parses fine. If you are interested in checking out the file, it can be downloaded in GZIP format from Jim Breen's site. Info Page: http://www.csse.monash.edu.au/~jwb/kanjidic2/XML File: http://www.csse.monash.edu.au/~jwb/kanjidic2/kanjidic2.xml.gz As a side note, I was able to succesfully parse this file with Apache Xerces Perl. Thank you for your time. Best Regards,Rick Noelle

RE: Valid XML comment throwing SAXParseException

Posted by jd...@msn.com.

Thanks for the prompt reply. I apologize for mistaking the two! I appreciate the clarification. Rick Noelle
 Subject: Re: Valid XML comment throwing SAXParseException
To: j-dev@xerces.apache.org
CC: j-users@xerces.apache.org
From: mrglavas@ca.ibm.com
Date: Mon, 5 Dec 2011 23:54:41 -0500


com.sun.org.apache.xerces.internal.* is not Apache Xerces.



The implementation which ships in the Oracle JDK is a fork of the Apache code base which Oracle/Sun has made all sorts of changes and additions to. If you are experiencing issues with that code base you would need to pursue it with Oracle. We have no influence over what they include in their versions.



Thanks.



Michael Glavassevich

XML Technologies and WAS Development

IBM Toronto Lab

E-mail: mrglavas@ca.ibm.com

E-mail: mrglavas@apache.org



<jd...@msn.com> wrote on 12/05/2011 11:10:44 PM:



> Hello,

>  

> I deal with Japanese text quite a bit and was recently parsing a 

> file that contained the Unicode character U+2000B (http://

> www.fileformat.info/info/unicode/char/2000B/index.htm) in a comment.

> This character appears to have caused a SAXParseException to be thrown:

>  

> [Fatal Error] :484236:25: An invalid XML character (Unicode: 0xd840)

> was found in the comment.

> org.xml.sax.SAXParseException; lineNumber: 484236; columnNumber: 25;

> An invalid XML character (Unicode: 0xd840) was found in the comment.

>  at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse

> (DOMParser.java:254)

>  at 

> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse

> (DocumentBuilderImpl.java:300)

>  

> In this particular case, I was attempting to parse Jim Breen's 

> publicly available Kanji dictionary file. This file is used quite 

> extensively in many Japanese/English open-source dictionaries. I 

> exchanged a few emails with Jim and he is confident that the XML is 

> valid. I've reviewed the "Characters" section of the W3C XML 1.0 spec (

> http://www.w3.org/TR/2004/REC-xml-20040204/#charsets) and honestly 

> can not tell for certain if U+2000B is valid in a comment.

>  

> Basically Jim's file has an entry for each kanji and a comment prior

> to each entry that looks like this:

>  

> <!-- Entry for Kanji: X -->

>  

> where X is the actual character. If I remove all such comments, the 

> file parses fine.

>  

> If you are interested in checking out the file, it can be downloaded

> in GZIP format from Jim Breen's site.

>  

> Info Page: http://www.csse.monash.edu.au/~jwb/kanjidic2/

> XML File: http://www.csse.monash.edu.au/~jwb/kanjidic2/kanjidic2.xml.gz

>  

> As a side note, I was able to succesfully parse this file with 

> Apache Xerces Perl.

>  

> Thank you for your time.

>  

> Best Regards,

> Rick Noelle

Re: Valid XML comment throwing SAXParseException

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

com.sun.org.apache.xerces.internal.* is not Apache Xerces.

The implementation which ships in the Oracle JDK is a fork of the Apache
code base which Oracle/Sun has made all sorts of changes and additions to.
If you are experiencing issues with that code base you would need to pursue
it with Oracle. We have no influence over what they include in their
versions.

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

<jd...@msn.com> wrote on 12/05/2011 11:10:44 PM:

> Hello,
>
> I deal with Japanese text quite a bit and was recently parsing a
> file that contained the Unicode character U+2000B (http://
> www.fileformat.info/info/unicode/char/2000B/index.htm) in a comment.
> This character appears to have caused a SAXParseException to be thrown:
>
> [Fatal Error] :484236:25: An invalid XML character (Unicode: 0xd840)
> was found in the comment.
> org.xml.sax.SAXParseException; lineNumber: 484236; columnNumber: 25;
> An invalid XML character (Unicode: 0xd840) was found in the comment.
>  at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse
> (DOMParser.java:254)
>  at
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse
> (DocumentBuilderImpl.java:300)
>
> In this particular case, I was attempting to parse Jim Breen's
> publicly available Kanji dictionary file. This file is used quite
> extensively in many Japanese/English open-source dictionaries. I
> exchanged a few emails with Jim and he is confident that the XML is
> valid. I've reviewed the "Characters" section of the W3C XML 1.0 spec (
> http://www.w3.org/TR/2004/REC-xml-20040204/#charsets) and honestly
> can not tell for certain if U+2000B is valid in a comment.
>
> Basically Jim's file has an entry for each kanji and a comment prior
> to each entry that looks like this:
>
> <!-- Entry for Kanji: X -->
>
> where X is the actual character. If I remove all such comments, the
> file parses fine.
>
> If you are interested in checking out the file, it can be downloaded
> in GZIP format from Jim Breen's site.
>
> Info Page: http://www.csse.monash.edu.au/~jwb/kanjidic2/
> XML File: http://www.csse.monash.edu.au/~jwb/kanjidic2/kanjidic2.xml.gz
>
> As a side note, I was able to succesfully parse this file with
> Apache Xerces Perl.
>
> Thank you for your time.
>
> Best Regards,
> Rick Noelle

Re: Valid XML comment throwing SAXParseException

Posted by Michael Glavassevich <mr...@ca.ibm.com>.

com.sun.org.apache.xerces.internal.* is not Apache Xerces.

The implementation which ships in the Oracle JDK is a fork of the Apache
code base which Oracle/Sun has made all sorts of changes and additions to.
If you are experiencing issues with that code base you would need to pursue
it with Oracle. We have no influence over what they include in their
versions.

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org

<jd...@msn.com> wrote on 12/05/2011 11:10:44 PM:

> Hello,
>
> I deal with Japanese text quite a bit and was recently parsing a
> file that contained the Unicode character U+2000B (http://
> www.fileformat.info/info/unicode/char/2000B/index.htm) in a comment.
> This character appears to have caused a SAXParseException to be thrown:
>
> [Fatal Error] :484236:25: An invalid XML character (Unicode: 0xd840)
> was found in the comment.
> org.xml.sax.SAXParseException; lineNumber: 484236; columnNumber: 25;
> An invalid XML character (Unicode: 0xd840) was found in the comment.
>  at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse
> (DOMParser.java:254)
>  at
> com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse
> (DocumentBuilderImpl.java:300)
>
> In this particular case, I was attempting to parse Jim Breen's
> publicly available Kanji dictionary file. This file is used quite
> extensively in many Japanese/English open-source dictionaries. I
> exchanged a few emails with Jim and he is confident that the XML is
> valid. I've reviewed the "Characters" section of the W3C XML 1.0 spec (
> http://www.w3.org/TR/2004/REC-xml-20040204/#charsets) and honestly
> can not tell for certain if U+2000B is valid in a comment.
>
> Basically Jim's file has an entry for each kanji and a comment prior
> to each entry that looks like this:
>
> <!-- Entry for Kanji: X -->
>
> where X is the actual character. If I remove all such comments, the
> file parses fine.
>
> If you are interested in checking out the file, it can be downloaded
> in GZIP format from Jim Breen's site.
>
> Info Page: http://www.csse.monash.edu.au/~jwb/kanjidic2/
> XML File: http://www.csse.monash.edu.au/~jwb/kanjidic2/kanjidic2.xml.gz
>
> As a side note, I was able to succesfully parse this file with
> Apache Xerces Perl.
>
> Thank you for your time.
>
> Best Regards,
> Rick Noelle