You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Kurt Wiseman <ku...@bea.com> on 2000/06/09 18:19:26 UTC

Performance problem with reading DTD's

We have a fairly large DTD (4700 lines) and the amount of time it takes to 
read/parse it is surprisingly large.

In the performance FAQ it talks about not specifying a DTD unless you have 
to because the current parser reads it even if validation is off?
Turn validation off if you don't need it. Validation is expensive. Also, avoid
                    using a DOCTYPE line in your XML document. The current 
version of the
                    parser will always read the DTD if the DOCTYPE line is 
specified even
                    when not validating.

Is this considered a bug or feature and will it be changed soon?

The processing we're doing involves reading in the xml via the DOMParser 
and then looking for special tags that require further processing before 
the resulting document can be handed to the SAXParser.  Once the SAXParser 
is invoked, the DTD is required but it's *not* needed during the DOMParser 
"pre-processing".

Any hints on how to get things "quicker" will be greatly appreciated.

Thanks,
Kurt


Re: Performance problem with reading DTD's

Posted by Arnaud Le Hors <le...@jtcsv.com>.
Kurt Wiseman wrote:
> 
> Unfortunately there are a bunch of entity defs in the DTD...
> 
> I also tried removing all comments and whitespace but the gain in speed was
> nothing to write home about.
> 
> I think the best solution is to have a feature that can be turned on/off
> where DTD's are ignorable.  I'd turn it on in the "pre-process" phase and
> turn it off before handing the results to the SAXParser.

That seems to be the only way to go in the long term. While the XML spec
clearly defines the behavior of a validating parser, it leaves plenty of
room for implementing various behaviors in non validating parsers. This
is between a minimally conformant parser and a fully validating parser,
often referred to as the "grey area". I personally think it sucks since
it leads to very poor interoperability but we've got to live with it.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Technology Group

Re: Performance problem with reading DTD's

Posted by Mike Pogue <mp...@apache.org>.
If there are entity defs in the DTD, then they need to be read, even though validation is
off, right?

The only other optimization I can think of is: if validation is OFF, don't read in the
DTD, unless an entity reference is encountered.  If it is, then pause parsing, read in the
entire DTD, then resume parsing.

Mike

Kurt Wiseman wrote:
> 
> Unfortunately there are a bunch of entity defs in the DTD...
> 
> I also tried removing all comments and whitespace but the gain in speed was
> nothing to write home about.
> 
> I think the best solution is to have a feature that can be turned on/off
> where DTD's are ignorable.  I'd turn it on in the "pre-process" phase and
> turn it off before handing the results to the SAXParser.
> 
> Any other thoughts?
> 
> Thanks,
> Kurt
> 
> At 09:33 AM 6/9/00 -0700, you wrote:
> >The DTD needs to be read, even when validation is off, because it might
> >contain
> >entity definitions.  If you don't use entity definitions at all, you might
> >want
> >to just leave the DOCTYPE line out.
> >
> >Mike
> >
> >Kurt Wiseman wrote:
> > >
> > > We have a fairly large DTD (4700 lines) and the amount of time it takes
> > to read/parse it
> > > is surprisingly large.
> > >
> > > In the performance FAQ it talks about not specifying a DTD unless you
> > have to because
> > > the current parser reads it even if validation is off?
> > >
> > >           Turn validation off if you don't need it. Validation is
> > expensive. Also, avoid
> > >
> > >                    using a DOCTYPE line in your XML document. The
> > current version of the
> > >                    parser will always read the DTD if the DOCTYPE line
> > is specified even
> > >                    when not validating.
> > >
> > > Is this considered a bug or feature and will it be changed soon?
> > >
> > > The processing we're doing involves reading in the xml via the
> > DOMParser and then
> > > looking for special tags that require further processing before the
> > resulting document
> > > can be handed to the SAXParser.  Once the SAXParser is invoked, the DTD
> > is required but
> > > it's *not* needed during the DOMParser "pre-processing".
> > >
> > > Any hints on how to get things "quicker" will be greatly appreciated.
> > >
> > > Thanks,
> > > Kurt
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> >For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: Performance problem with reading DTD's

Posted by Kurt Wiseman <ku...@bea.com>.
Unfortunately there are a bunch of entity defs in the DTD...

I also tried removing all comments and whitespace but the gain in speed was 
nothing to write home about.

I think the best solution is to have a feature that can be turned on/off 
where DTD's are ignorable.  I'd turn it on in the "pre-process" phase and 
turn it off before handing the results to the SAXParser.

Any other thoughts?

Thanks,
Kurt

At 09:33 AM 6/9/00 -0700, you wrote:
>The DTD needs to be read, even when validation is off, because it might 
>contain
>entity definitions.  If you don't use entity definitions at all, you might 
>want
>to just leave the DOCTYPE line out.
>
>Mike
>
>Kurt Wiseman wrote:
> >
> > We have a fairly large DTD (4700 lines) and the amount of time it takes 
> to read/parse it
> > is surprisingly large.
> >
> > In the performance FAQ it talks about not specifying a DTD unless you 
> have to because
> > the current parser reads it even if validation is off?
> >
> >           Turn validation off if you don't need it. Validation is 
> expensive. Also, avoid
> >
> >                    using a DOCTYPE line in your XML document. The 
> current version of the
> >                    parser will always read the DTD if the DOCTYPE line 
> is specified even
> >                    when not validating.
> >
> > Is this considered a bug or feature and will it be changed soon?
> >
> > The processing we're doing involves reading in the xml via the 
> DOMParser and then
> > looking for special tags that require further processing before the 
> resulting document
> > can be handed to the SAXParser.  Once the SAXParser is invoked, the DTD 
> is required but
> > it's *not* needed during the DOMParser "pre-processing".
> >
> > Any hints on how to get things "quicker" will be greatly appreciated.
> >
> > Thanks,
> > Kurt
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
>For additional commands, e-mail: xerces-j-dev-help@xml.apache.org



Re: Performance problem with reading DTD's

Posted by Mike Pogue <mp...@apache.org>.
The DTD needs to be read, even when validation is off, because it might contain
entity definitions.  If you don't use entity definitions at all, you might want
to just leave the DOCTYPE line out.

Mike

Kurt Wiseman wrote:
> 
> We have a fairly large DTD (4700 lines) and the amount of time it takes to read/parse it
> is surprisingly large.
> 
> In the performance FAQ it talks about not specifying a DTD unless you have to because
> the current parser reads it even if validation is off?
> 
>           Turn validation off if you don't need it. Validation is expensive. Also, avoid
> 
>                    using a DOCTYPE line in your XML document. The current version of the
>                    parser will always read the DTD if the DOCTYPE line is specified even
>                    when not validating.
> 
> Is this considered a bug or feature and will it be changed soon?
> 
> The processing we're doing involves reading in the xml via the DOMParser and then
> looking for special tags that require further processing before the resulting document
> can be handed to the SAXParser.  Once the SAXParser is invoked, the DTD is required but
> it's *not* needed during the DOMParser "pre-processing".
> 
> Any hints on how to get things "quicker" will be greatly appreciated.
> 
> Thanks,
> Kurt

Re: Performance problem with reading DTD's

Posted by Andy Clark <an...@apache.org>.
Kurt Wiseman wrote:
> In the performance FAQ it talks about not specifying a DTD unless you
> have to because the current parser reads it even if validation is off?
> [...]
> Is this considered a bug or feature and will it be changed soon?

It could be considered both, actually. Some people want it because
they want their default attributes to appear and other attribute
values to be normalized. However, other people (like yourself) 
don't want the DTD to be read at all when validation is turned
off.

Currently, the parser reads the DTD, if a DOCTYPE line is present.
However, conceivably in the future we could provide a range of
validation options where one setting is not to read the grammar
at all. This however, does not have priority and there is no
timeframe for such a feature.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org