You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by Mike O'Leary <tm...@comcast.net> on 2007/04/18 10:05:55 UTC

Ignoring missing end tag errors

I wrote an XML parser using the SAXParser. It turns out that the XML files I
need to parse are somewhat noisy, and there are cases where there is no end
tag for a given start tag. I would like to catch these errors immediately
and proceed as if the end tag was read (in the cases I have looked at, the
missing end tag causes no ambiguity), but I don't see how to do that. The
documentation for the DefaultHandler class, which I am using to define
handler functions, says that it supports the functions error, fatalError and
warning, but when my parser hits a place where an end tag is missing, the
underlying parser code throws an exception instead of calling any of these
functions, and I don't see how to catch that exception in a way that would
allow the parser can continue reading the xml file. The error message and
call stack look like this:

 

org.xml.sax.SAXParseException: The element type "P" must be terminated by
the matching end-tag "</P>".

        at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseEx
ception(ErrorHandlerWrapper.java:236)

        at
com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(Error
HandlerWrapper.java:215)

        at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErro
rReporter.java:386)

        at
com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErro
rReporter.java:316)

        at
com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScann
er.java:1438)

        at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanE
ndElement(XMLDocumentFragmentScannerImpl.java:1219)

        at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$Fragm
entContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)

        at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanD
ocument(XMLDocumentFragmentScannerImpl.java:368)

        at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Con
figuration.java:834)

        at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Con
figuration.java:764)

        at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:14
8)

Is there a way to define an error handler that ignores certain kinds of
errors and to have it be used instead of, say, the ErrorHandlerWrapper in
this call stack whose fatalError function creates and throws an exception in
all cases? Is it reasonable to want to do this, given that the parser
considers an error of this kind to be fatal?

 


Re: Ignoring missing end tag errors

Posted by ke...@us.ibm.com.
If end tags are missing, the data simply isn't XML and you shouldn't expect
XML tools to handle it. Part of the point of moving from SGML to XML was
precisely to drive folks toward writing well-formed documents rather than
trying to guess past their errors.

If it's HTML (which is based on SGML), you may want to look at the NekoHTML
parser, which tolerates some sloppiness, or the W3C's "Tidy" tool which
attempts to recover from much more.

If it's something else... Frankly, the right answer here is to fix the file
before it reaches the XML parser, either by fixing whatever generates it or
by running a preprocessor of some sort.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
(http://www.ovff.org/pegasus/songs/threes-rev-11.html)