You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xml.apache.org by George Armhold <ar...@cs.rutgers.edu> on 2001/09/06 20:15:44 UTC
bug in Crimson: parsing UTF-8 chars in DTD comment fields
Hi,
I'd like to report what I think is a bug in Crimson (as obtained with
Sun's JAXP 1.1 reference implementation.) I'm fairly new to XML, and
I may be off-base here, so please bear with me. I'm trying to parse a
MusicXML document (see http://www.musicxml.org) and Crimson is giving
me
org.xml.sax.SAXParseException: Character conversion error: "Illegal
ASCII character, 0xc2" (line number may be too low).
when it encounters DTD's that have UTF-8 encoded characters in the
comment fields. In the case of MusicXML, the character is a two-byte
copyright symbol: ©. I believe that this is correct UTF-8, and that
it should be parsed correctly. MusicXML is a complex hierarchy of
DTD's, so I've boiled it down to a simple example which I think
demonstrates the problem. An example document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE simple PUBLIC
"-//Armhold//Simple DTD//EN"
"http://pablo.rutgers.edu/~armhold/dtds/simple.dtd">
<simple>
</simple>
The referenced simple.dtd contains the following:
<?xml version="1.0" encoding="UTF-8"?>
<!--
A really simple DTD.
Copyright © 2000-2001.
-->
When I try to parse this with Sun's example parser
(http://java.sun.com/xml/jaxp-1.1/docs/tutorial/dom/work/DomEcho01.java)
I get the following:
org.xml.sax.SAXParseException: Character conversion error: "Illegal
ASCII character, 0xc2" (line number may be too low).
at
org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1038)
at
org.apache.crimson.parser.InputEntity.fillbuf(InputEntity.java:1010)
at
org.apache.crimson.parser.InputEntity.peek(InputEntity.java:841)
at org.apache.crimson.parser.Parser2.peek(Parser2.java:3000)
at
org.apache.crimson.parser.Parser2.maybeTextDecl(Parser2.java:2725)
at
org.apache.crimson.parser.Parser2.externalParameterEntity(Parser2.java:2806)
at
org.apache.crimson.parser.Parser2.maybeDoctypeDecl(Parser2.java:1155)
at
org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:489)
at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:433)
at
org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:185)
at
javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:161)
at DomEcho01.main(DomEcho01.java:63)
Removing the copyright chars from my DTD solves the problem. I'm
using JDK 1.3.0 w/ JAXP 1.1. Can someone please confirm this as a
bug, or enlighten me as to what I'm doing wrong?
Thanks
--
George Armhold
Rutgers University
Bioinformatics Initiative
---------------------------------------------------------------------
In case of troubles, e-mail: webmaster@xml.apache.org
To unsubscribe, e-mail: general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org
Re: bug in Crimson: parsing UTF-8 chars in DTD comment fields
Posted by Edwin Goei <ed...@sun.com>.
George Armhold wrote:
>
> Hi,
>
> I'd like to report what I think is a bug in Crimson (as obtained with
> Sun's JAXP 1.1 reference implementation.) I'm fairly new to XML, and
> I may be off-base here, so please bear with me. I'm trying to parse a
> MusicXML document (see http://www.musicxml.org) and Crimson is giving
> me
>
> org.xml.sax.SAXParseException: Character conversion error: "Illegal
> ASCII character, 0xc2" (line number may be too low).
The RI version 1.1 has some serious bugs in it. See my JAXP FAQ for
more info at http://xml.apache.org/~edwingo/jaxp-faq.html. Could you
try your app with the latest version of crimson? I just posted a
message announcing crimson 1.1.2beta2.
-Edwin
---------------------------------------------------------------------
In case of troubles, e-mail: webmaster@xml.apache.org
To unsubscribe, e-mail: general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org