You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xml.apache.org by George Armhold <ar...@cs.rutgers.edu> on 2001/09/06 20:15:44 UTC

bug in Crimson: parsing UTF-8 chars in DTD comment fields

Hi,

I'd like to report what I think is a bug in Crimson (as obtained with
Sun's JAXP 1.1 reference implementation.)  I'm fairly new to XML, and
I may be off-base here, so please bear with me.  I'm trying to parse a
MusicXML document (see http://www.musicxml.org) and Crimson is giving
me

org.xml.sax.SAXParseException: Character conversion error: "Illegal
ASCII character, 0xc2" (line number may be too low).

when it encounters DTD's that have UTF-8 encoded characters in the
comment fields.  In the case of MusicXML, the character is a two-byte
copyright symbol: ©.  I believe that this is correct UTF-8, and that
it should be parsed correctly.  MusicXML is a complex hierarchy of
DTD's, so I've boiled it down to a simple example which I think
demonstrates the problem.  An example document:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE simple PUBLIC
        "-//Armhold//Simple DTD//EN"
        "http://pablo.rutgers.edu/~armhold/dtds/simple.dtd">
    
    <simple>
    </simple>

The referenced simple.dtd contains the following:

    <?xml version="1.0" encoding="UTF-8"?>
    <!--
          A really simple DTD.
          Copyright © 2000-2001.
    -->

When I try to parse this with Sun's example parser
(http://java.sun.com/xml/jaxp-1.1/docs/tutorial/dom/work/DomEcho01.java)
I get the following:

org.xml.sax.SAXParseException: Character conversion error: "Illegal
ASCII character, 0xc2" (line number may be too low).
        at
org.apache.crimson.parser.InputEntity.fatal(InputEntity.java:1038)
        at
org.apache.crimson.parser.InputEntity.fillbuf(InputEntity.java:1010)
        at
org.apache.crimson.parser.InputEntity.peek(InputEntity.java:841)
        at org.apache.crimson.parser.Parser2.peek(Parser2.java:3000)
        at
org.apache.crimson.parser.Parser2.maybeTextDecl(Parser2.java:2725)
        at
org.apache.crimson.parser.Parser2.externalParameterEntity(Parser2.java:2806)
        at
org.apache.crimson.parser.Parser2.maybeDoctypeDecl(Parser2.java:1155)
        at
org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:489)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:305)
        at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:433)
        at
org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:185)
        at
javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:161)
        at DomEcho01.main(DomEcho01.java:63)


Removing the copyright chars from my DTD solves the problem.  I'm
using JDK 1.3.0 w/ JAXP 1.1.  Can someone please confirm this as a
bug, or enlighten me as to what I'm doing wrong?

Thanks

--
George Armhold
Rutgers University
Bioinformatics Initiative

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org


Re: bug in Crimson: parsing UTF-8 chars in DTD comment fields

Posted by Edwin Goei <ed...@sun.com>.
George Armhold wrote:
> 
> Hi,
> 
> I'd like to report what I think is a bug in Crimson (as obtained with
> Sun's JAXP 1.1 reference implementation.)  I'm fairly new to XML, and
> I may be off-base here, so please bear with me.  I'm trying to parse a
> MusicXML document (see http://www.musicxml.org) and Crimson is giving
> me
> 
> org.xml.sax.SAXParseException: Character conversion error: "Illegal
> ASCII character, 0xc2" (line number may be too low).

The RI version 1.1 has some serious bugs in it.  See my JAXP FAQ for
more info at http://xml.apache.org/~edwingo/jaxp-faq.html.  Could you
try your app with the latest version of crimson?  I just posted a
message announcing crimson 1.1.2beta2.

-Edwin

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org