You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cocoon.apache.org by Anders Conrad <ac...@dsl.dk> on 2001/06/28 12:10:04 UTC

Parsing problem

Hello.

After sucessfully installing Cocoon 1.8.2, I encounter parsing errors when trying to transform xml documents into html using the cocoon xslt processor. The documents are encoded in UTF-8, and the problem is caused by element names containing the Danish letters æ, ø  and å (ae, oe, aa) in the UTF-8 encoding. The same letters in regular text are parsed correctly, though, the problem only occurs in element names. The documents in question have been parsed without problems in XMLspy and James Clerk's SP.

Has anyone else encountered this problem? Of course, an obvious solution is to avoid non-english characters in element names, but this may require large amounts of filtering of existing texts, changing of DTD's etc. To my best knowledge, non-English characters should be allowed in XML names.

The platform is: Suse Linux 7.1, Apache 1.3.14, Tomcat 3.2.2, JDK 1.1.8.

The error stack is the following:
org.xml.sax.SAXException: A ')' is required in the declaration of element type "simpledoc". [FATAL ERROR] [File: "file:/var/jakarta-tomcat-3.2.2/webapps/cocoon/diplo/charbug.dtd" Line: 3 Column: 24] (nested exception: org.xml.sax.SAXParseException: A ')' is required in the declaration of element type "simpledoc". )
	at org.apache.cocoon.parser.AbstractParser.fatalError(AbstractParser.java:105)
	at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1037)
	at org.apache.xerces.framework.XMLDTDScanner.reportFatalXMLError(XMLDTDScanner.java:654)
	at org.apache.xerces.framework.XMLDTDScanner.scanChildren(XMLDTDScanner.java:1979)
	at org.apache.xerces.framework.XMLDTDScanner.scanElementDecl(XMLDTDScanner.java:1771)
	at org.apache.xerces.framework.XMLDTDScanner.scanDecls(XMLDTDScanner.java:1436)
	at org.apache.xerces.framework.XMLDocumentScanner.scanDoctypeDecl(XMLDocumentScanner.java:2179)
	at org.apache.xerces.framework.XMLDocumentScanner.access$0(XMLDocumentScanner.java:2133)
	at org.apache.xerces.framework.XMLDocumentScanner$PrologDispatcher.dispatch(XMLDocumentScanner.java:882)
	at org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.java:380)
	at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:900)
	at org.apache.cocoon.parser.XercesParser.parse(XercesParser.java:85)
	at org.apache.cocoon.parser.AbstractParser.parse(AbstractParser.java:83)
	at org.apache.cocoon.producer.ProducerFromFile.getDocument(ProducerFromFile.java:78)
	at org.apache.cocoon.Engine.handle(Engine.java:359)
	at org.apache.cocoon.Cocoon.service(Cocoon.java:183)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
	at org.apache.tomcat.core.ServletWrapper.doService(ServletWrapper.java:405)
	at org.apache.tomcat.core.Handler.service(Handler.java:287)
	at org.apache.tomcat.core.ServletWrapper.service(ServletWrapper.java:372)
	at org.apache.tomcat.core.ContextManager.internalService(ContextManager.java:797)
	at org.apache.tomcat.core.ContextManager.service(ContextManager.java:743)
	at org.apache.tomcat.service.connector.Ajp13ConnectionHandler.processConnection(Ajp13ConnectionHandler.java:160)
	at org.apache.tomcat.service.TcpWorkerThread.runIt(PoolTcpEndpoint.java:416)
	at org.apache.tomcat.util.ThreadPool$ControlRunnable.run(ThreadPool.java:501)
	at java.lang.Thread.run(Thread.java)

The DTD in question looks like this:

<?xml version="1.0" encoding="UTF-8"?>

<!-- edited with XML Spy v3.5 NT (http://www.xmlspy.com) by Anders Conrad (DSL) -->

<!ELEMENT simpledoc (række)>

<!ELEMENT række (#PCDATA)>

and the fix would be a change to the following (with the similar fix in the test document):

<?xml version="1.0" encoding="UTF-8"?>

<!-- edited with XML Spy v3.5 NT (http://www.xmlspy.com) by Anders Conrad (DSL) -->

<!ELEMENT simpledoc (raekke)>

<!ELEMENT raekke (#PCDATA)>

I have the entire reproducible available in case somone is interested.

Any suggestions or commentary would be welcome!

Anders
 
Anders Conrad                  Det Danske Sprog- og Litteraturselskab
IT-redaktør, cand.mag.       Christians Brygge 1
E-mail: ac@dsl.dk             1219 København K
                                        Tlf. 33 13 06 60