You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@xml.apache.org by David Kellum <de...@aol.com> on 2001/02/23 20:34:43 UTC

Differences in DTD handling between Xerces-J and Crimson

I'm writing a performance minded server in Java that needs to repeatedly
parse relatively small (5k) XML documents obtained from a remote
server.  I don't need or want to have the overhead of any validation in
this parse.  However, the returned document includes a doctype
declaration like so:

<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
<!DOCTYPE FOO SYSTEM "foo.dtd">
<FOO>
...
</FOO>

I don't control this format so I can't get rid of the DOCTYPE
declaration without performing some error-prone hack like stripping the
line out before passing it to the parser. 

With Xerces-J I can work around this by using the SAX-2 interface's
XMLReader.setEntityResolver() to an instance of the following:

public class NullEntityResolver 
    implements EntityResolver 
{
    public InputSource resolveEntity( String publicId, String systemId )
    {
       return new InputSource( new ByteArrayInputStream( new byte[0] )
);
    }
}

However, I can't seem to do the same with the Crimson 1.1 parser.  Here
I get the following:

Exception in thread "main" org.xml.sax.SAXParseException: Relative URI
"foo.dtd"; can not be resolved without a document URI.
    at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3035)
    at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3029)
    at org.apache.crimson.parser.Parser2.parseSystemId(Parser2.java:2627)
    at org.apache.crimson.parser.Parser2.maybeExternalID(Parser2.java:2605)
at org.apache.crimson.parser.Parser2.maybeDoctypeDecl(Parser2.java:1116)
    at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:488)
    at org.apache.crimson.parser.Parser2.parse(Parser2.java:304)
    at
org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:433)
    at TestParse.main(TestParse.java:39)


Why do I want to use Crimson you ask?  For this document type I'm seeing
that it is performing twice as fast as Xerces J 1.3. (After the Java 1.3
-server VM has time to warm up.)

Any suggestions on how I might work around this with Crimson?  Any
comments on the validity of my approach for dealing with this in
Xerces-J?

Thanks,
David

Re: Differences in DTD handling between Xerces-J and Crimson

Posted by Arnaud Le Hors <le...@us.ibm.com>.
David Kellum wrote:
> 
> I'm writing a performance minded server in Java that needs to repeatedly
> parse relatively small (5k) XML documents obtained from a remote
> server.  I don't need or want to have the overhead of any validation in
> this parse.  However, the returned document includes a doctype
> declaration like so:

FYI, I added to Xerces some time ago a feature that allows to simply
turn DTD loading off. Look for 'load-external-dtd' in the list of
features.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

Re: Differences in DTD handling between Xerces-J and Crimson

Posted by David Kellum <de...@aol.com>.
Edwin Goei wrote:
> Try this, in the code that starts the parse, there is a SAX InputSource
> object representing the main document.  Use InputSource.setSystemID() to
> set some URI on the main document.  I believe an empty string ("")
> should work as well.  HTH,

Thanks!  This did the trick and now I'm seeing better performance with
crimson.  This also doesn't hurt xerces so my code can work with either
of these parsers.

--David

Re: Differences in DTD handling between Xerces-J and Crimson

Posted by Edwin Goei <Ed...@eng.sun.com>.
David Kellum wrote:
> 
> I'm writing a performance minded server in Java that needs to repeatedly
> parse relatively small (5k) XML documents obtained from a remote
> server.  I don't need or want to have the overhead of any validation in
> this parse.  However, the returned document includes a doctype
> declaration like so:
> 
> <?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>
> <!DOCTYPE FOO SYSTEM "foo.dtd">
> <FOO>
> ...
> </FOO>
> 
> I don't control this format so I can't get rid of the DOCTYPE
> declaration without performing some error-prone hack like stripping the
> line out before passing it to the parser.
> 
> With Xerces-J I can work around this by using the SAX-2 interface's
> XMLReader.setEntityResolver() to an instance of the following:
> 
> public class NullEntityResolver
>     implements EntityResolver
> {
>     public InputSource resolveEntity( String publicId, String systemId )
>     {
>        return new InputSource( new ByteArrayInputStream( new byte[0] )
> );
>     }
> }
> 
> However, I can't seem to do the same with the Crimson 1.1 parser.  Here
> I get the following:
> 
> Exception in thread "main" org.xml.sax.SAXParseException: Relative URI
> "foo.dtd"; can not be resolved without a document URI.
>     at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3035)
>     at org.apache.crimson.parser.Parser2.fatal(Parser2.java:3029)
>     at org.apache.crimson.parser.Parser2.parseSystemId(Parser2.java:2627)
>
> Any suggestions on how I might work around this with Crimson?  Any
> comments on the validity of my approach for dealing with this in
> Xerces-J?

Sounds like a good approach.  Looking at the crimson code, it looks like
the parser tries to resolve the SystemID in the doctype decl which is a
relative URI so it tries to get the base URI of the document, which is
null, hence the exception.

Try this, in the code that starts the parse, there is a SAX InputSource
object representing the main document.  Use InputSource.setSystemID() to
set some URI on the main document.  I believe an empty string ("")
should work as well.  HTH,

-Edwin