You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Lisa Retief <li...@exinet.co.za> on 2000/06/23 15:01:22 UTC

Valid document breaks parser?

Hi all,

Attached is an XHTML document (TOChapter12.htm) which is well-formed and
valid (it validates using various validating XML Editors). However, when we
parse it using Xerces 1.0.3, 1.04 or 1.1.1 it breaks with the following
exception:

org.xml.sax.SAXParseException: The element type "font" must be
terminated by the matching end-tag "</font>".
 at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:925)
 at
org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocume
ntScanner.java:634)
 at
org.apache.xerces.framework.XMLDocumentScanner.abortMarkup(XMLDocumentScanne
r.java:683)
 at
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
LDocumentScanner.java:1187)
 at
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
java:380)
 at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:817)
 at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:856)
 at exinet.biscotti.content.publish.Test.main(Test.java:30)

This wierd thing is that if I remove the namespace declarations from the
document element, it then parses without breaking.

The attached ZIP file contains the file along with the necessary DTD and
.ent files.

Does anyone have an idea what is going wrong here?

Please note that this XHTML DTD is a modified version of the public one - it
has been necessary to do this in order to change all references of the
"xml:space" attribute from
xml:space   (preserve)    #FIXED 'preserve'
to
xml:space   (default|preserve)     #FIXED 'preserve'.

The original DTD did not conform to the XML 1.0 spec (section 2.10) and
correctly broke Xerces.

Regards, Lisa Retief

Re: Valid document breaks parser?

Posted by Norman Walsh <nd...@nwalsh.com>.

/ lisa@exinet.co.za (Lisa Retief) was heard to say:
| Attached is an XHTML document (TOChapter12.htm) which is well-formed and
| valid (it validates using various validating XML Editors). However, when we
| parse it using Xerces 1.0.3, 1.04 or 1.1.1 it breaks with the following
| exception:

I think the problem is that your document has no encoding declaration
(so UTF-8 is probably assumed?) and contains literal, 8-bit
non-breaking space characters. Change them to &#160;'s or specify
the right encoding and I bet the problem goes away.

                                        Be seeing you,
                                          norm

-- 
Norman Walsh <nd...@nwalsh.com> | A proof tells us where to concentrate
http://nwalsh.com/            | our doubts.--Anonymous

Re: Valid document breaks parser - definitely starting to look like a bug.

Posted by Lisa Retief <li...@exinet.co.za>.

We have managed to determine that this problem is happening during the
process of reparsing a document that resulted from the Xerces serializer. We
ran a simple test which parsed the document, then serialized it, and then
parsed it again. On the second parse it breaks saying that the document is
invalid. Below is the test code and StackTrace. Attached is the document and
the resources it uses.

File resultFile = new File("/java/test1/OChapter12.htm");
DOMParser parser = new DOMParser();
parser.setEntityResolver(new CustomEntityResolver());
parser.parse(resultFile.getAbsolutePath());
Document document = parser.getDocument();
System.out.println("here");

File file2 = new File("/java/test1/OChapter121.htm");
PrintWriter out = new PrintWriter(new
FileWriter(file2.getAbsolutePath(), false));
OutputFormat format = new OutputFormat(document, null, true);
format.setPreserveSpace(true);
XMLSerializer serializer = new XMLSerializer(out, format);
serializer.asDOMSerializer().serialize(document);

out.close();
System.out.println("here2");
DOMParser parser2 = new DOMParser();
parser2.setEntityResolver(new CustomEntityResolver());
parser2.parse(file2.getAbsolutePath());
Document document2 = parser2.getDocument();
System.out.println("here3");

Stack trace as follows:

org.xml.sax.SAXParseException: The element type "font" must be
terminated by the matching end-tag "</font>".
at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:1318)
at
org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocume
ntScanner.java:625)
at
org.apache.xerces.framework.XMLDocumentScanner.abortMarkup(XMLDocumentScanne
r.java:674)
at
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
LDocumentScanner.java:1176)
at
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
java:381)
at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1208)
at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:1247)
at exinet.biscotti.content.publish.Test.main(Test.java:45)




Lisa Retief wrote:

> Hi all,
>
> Attached is an XHTML document (TOChapter12.htm) which is well-formed and
> valid (it validates using various validating XML Editors). However, when
we
> parse it using Xerces 1.0.3, 1.04 or 1.1.1 it breaks with the following
> exception:
>
> org.xml.sax.SAXParseException: The element type "font" must be
> terminated by the matching end-tag "</font>".
>  at org.apache.xerces.framework.XMLParser.reportError(XMLParser.java:925)
>  at
>
org.apache.xerces.framework.XMLDocumentScanner.reportFatalXMLError(XMLDocume
> ntScanner.java:634)
>  at
>
org.apache.xerces.framework.XMLDocumentScanner.abortMarkup(XMLDocumentScanne
> r.java:683)
>  at
>
org.apache.xerces.framework.XMLDocumentScanner$ContentDispatcher.dispatch(XM
> LDocumentScanner.java:1187)
>  at
>
org.apache.xerces.framework.XMLDocumentScanner.parseSome(XMLDocumentScanner.
> java:380)
>  at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:817)
>  at org.apache.xerces.framework.XMLParser.parse(XMLParser.java:856)
>  at exinet.biscotti.content.publish.Test.main(Test.java:30)
>
> This wierd thing is that if I remove the namespace declarations from the
> document element, it then parses without breaking.
>
> The attached ZIP file contains the file along with the necessary DTD and
> .ent files.
>
> Does anyone have an idea what is going wrong here?
>
> Please note that this XHTML DTD is a modified version of the public one -
it
> has been necessary to do this in order to change all references of the
> "xml:space" attribute from
> xml:space   (preserve)    #FIXED 'preserve'
> to
> xml:space   (default|preserve)     #FIXED 'preserve'.
>
> The original DTD did not conform to the XML 1.0 spec (section 2.10) and
> correctly broke Xerces.
>
> Regards, Lisa Retief
>
>
>
>
>
>
>
>
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: Valid document breaks parser?

Posted by Lisa Retief <li...@exinet.co.za>.

Lisa Retief wrote:

> This wierd thing is that if I remove the namespace declarations from the
> document element, it then parses without breaking.

Please ignore this part of my previous post. I was trusting the report from
one of my developers but when I went to run the test myself this did not
take away the problem. It is still a pretty wierd problem though...

Lisa