You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by Dee Jay Randall <ra...@circularreasoning.com> on 2001/10/04 01:54:27 UTC

SAX performance observations

  Two observations on performance implementing a class derived
from DefaultHandler:


1. If you use a large conditional for handling different tags, you
should do the comparisons in the order of most likely to be seen.

element distribution: Tag1 (10%); Tag2 (70%); Tag3 (20%)

Then:

void startElement(...)
{
  if ( DOMString("Tag2").equals(localname) )
    ; // do something
  else if ( DOMString("Tag3").equals(localname) )
    ; // do something
  else if ( DOMString("Tag1").equals(localname) )
    ; // do something
}

  In my application (with 18 tags), this made about a 3 X performance
increase.

  Also, it should save string comparisons if all the tags were distinct
by their first character, but I don't seriously recommend this.




2. It appears that it is very expensive to do a DOMString("MyTag").
Basically don't ever do this:

void startElement(...)
{
  if ( DOMString("MyTag").equals(localname) )
    ; // do something
}


Instead, do this:

void startElement(...)
{
  static const DOMString cDOMStringMyTag("MyTag");

  if ( cDOMStringMyTag.equals(localname) )
    ; // do something
}

  In my application this caused about a 7 X performance increase on
top of the previous 3 X increase. I went from reading 100K elements
in 240 seconds to about 11 seconds.



  Can anyone offer any comments on further accelerating my parsing?


  Thanks,
  Dee Jay

+-----------------------------+------------------+-----------------------+
| Founding Partner            | Software Engineer| Dee Jay Randall, B.Sc.|
| Circular Reasoning          | Accrue Software  | M.Sc. Student, CS     |
| randal@circularreasoning.com| www.accrue.com   | ICQ # 43551676        |
+-----------------------------+------------------+-----------------------+
What is the average rank of every song ever written? 42  -- www.launch.com

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: SAX performance observations

Posted by Dee Jay Randall <ra...@circularreasoning.com>.

  Thank you Mark (and Dean) for the comments.

  I used a combination of gprof, commenting out my "do somethings"
as you suggest, and making my handlers do nothing at all to
determine where the speed problems were. They were primarily
in the startElement handler string recognition part. There is
of course time consumed by my "do somethings". They build some
STL maps. Ensuring that one is optimizing the right part is very
good advice for future readers of this thread.

  It is now fast enough for me. I will save using a perfect hash
or a DFA for a later project. :)

  Thanks,
  Dee Jay

+-----------------------------+------------------+-----------------------+
| Founding Partner            | Software Engineer| Dee Jay Randall, B.Sc.|
| Circular Reasoning          | Accrue Software  | M.Sc. Student, CS     |
| randal@circularreasoning.com| www.accrue.com   | ICQ # 43551676        |
+-----------------------------+------------------+-----------------------+
What is the average rank of every song ever written? 42  -- www.launch.com


On Thu, Oct 04 2001 at 12:59:31P +0100, Mark Weaver wrote:
> For string parsing a good approximation to best parsing can be provided by
> using a hash table.  You will only have to do comparisons then in the case
> of collisons.  Since your tags are fixed, careful choice of a hash function
> and hash table size will mean that they at least don't coincide.  If you can
> guarantee that you only ever see these tags (e.g. you have validated the
> document against a DTD), then you can construct a perfect hash function,
> which will remove all collisions - GNU gperf is good for this.
> 
> If you want to go one better then you can construct a DFA to recognise the
> strings; DFAs can be generated by lex/flex (which you will have to hack on
> quite a lot to support the unicode chars), or if you are feeling brave :-)
> you can construct them by hand.  Basically DFAs cut out comparison and
> reduce everything to table lookups (speed gained at the expense of code
> bloat).  I got a good speed increase in a text parser of ~3X by recoding it
> as a DFA.
> 
> However - before you go overboard with this, are you sure that your parsing
> speed problems still lie in the startElement handler/string recognition?
> One easy way to tell: remove all "do somethings" and time.  Remove all code
> and time.  Compare the two.  If there is only a small percentage difference
> then you are optimising the wrong code!
> 
> Mark
> 
> > -----Original Message-----
> > From: Dee Jay Randall [mailto:randal@circularreasoning.com]
> > Sent: 04 October 2001 00:54
> > To: xerces-c-dev@xml.apache.org
> > Subject: SAX performance observations
> >
> >
> >
> >   Two observations on performance implementing a class derived
> > from DefaultHandler:
> >
> >
> > 1. If you use a large conditional for handling different tags, you
> > should do the comparisons in the order of most likely to be seen.
> >
> > element distribution: Tag1 (10%); Tag2 (70%); Tag3 (20%)
> >
> > Then:
> >
> > void startElement(...)
> > {
> >   if ( DOMString("Tag2").equals(localname) )
> >     ; // do something
> >   else if ( DOMString("Tag3").equals(localname) )
> >     ; // do something
> >   else if ( DOMString("Tag1").equals(localname) )
> >     ; // do something
> > }
> >
> >   In my application (with 18 tags), this made about a 3 X performance
> > increase.
> >
> >   Also, it should save string comparisons if all the tags were distinct
> > by their first character, but I don't seriously recommend this.
> >
> >
> >
> >
> > 2. It appears that it is very expensive to do a DOMString("MyTag").
> > Basically don't ever do this:
> >
> > void startElement(...)
> > {
> >   if ( DOMString("MyTag").equals(localname) )
> >     ; // do something
> > }
> >
> >
> > Instead, do this:
> >
> > void startElement(...)
> > {
> >   static const DOMString cDOMStringMyTag("MyTag");
> >
> >   if ( cDOMStringMyTag.equals(localname) )
> >     ; // do something
> > }
> >
> >   In my application this caused about a 7 X performance increase on
> > top of the previous 3 X increase. I went from reading 100K elements
> > in 240 seconds to about 11 seconds.
> >
> >
> >
> >   Can anyone offer any comments on further accelerating my parsing?
> >
> >
> >   Thanks,
> >   Dee Jay
> >
> > +-----------------------------+------------------+-----------------------+
> > | Founding Partner            | Software Engineer| Dee Jay Randall, B.Sc.|
> > | Circular Reasoning          | Accrue Software  | M.Sc. Student, CS     |
> > | randal@circularreasoning.com| www.accrue.com   | ICQ # 43551676        |
> > +-----------------------------+------------------+-----------------------+
> > What is the average rank of every song ever written? 42  -- www.launch.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> >
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

Re: SAX performance observations

Posted by Dean Roddey <dr...@charmedquark.com>.

Better yet, get something like True Time or VTune or other such performance
testing tools. You'll be likely to find the low hanging fruit quickly and
get the most bang per time spent. We used to use VTune pretty often on
Xerces to find performance issues, and it massively improved in performance
between the 1.1 and 1.2 code bases, if I remember correctly. I use VTune at
work and at home and its great.

--------------------------
Dean Roddey
The Charmed Quark Controller
Charmed Quark Software
droddey@charmedquark.com
http://www.charmedquark.com

"If it don't have a control port, don't buy it!"


----- Original Message -----
From: "Mark Weaver" <ma...@npsl.co.uk>
To: <xe...@xml.apache.org>
Sent: Wednesday, October 03, 2001 4:59 PM
Subject: RE: SAX performance observations


> For string parsing a good approximation to best parsing can be provided by
> using a hash table.  You will only have to do comparisons then in the case
> of collisons.  Since your tags are fixed, careful choice of a hash
function
> and hash table size will mean that they at least don't coincide.  If you
can
> guarantee that you only ever see these tags (e.g. you have validated the
> document against a DTD), then you can construct a perfect hash function,
> which will remove all collisions - GNU gperf is good for this.
>
> If you want to go one better then you can construct a DFA to recognise the
> strings; DFAs can be generated by lex/flex (which you will have to hack on
> quite a lot to support the unicode chars), or if you are feeling brave :-)
> you can construct them by hand.  Basically DFAs cut out comparison and
> reduce everything to table lookups (speed gained at the expense of code
> bloat).  I got a good speed increase in a text parser of ~3X by recoding
it
> as a DFA.
>
> However - before you go overboard with this, are you sure that your
parsing
> speed problems still lie in the startElement handler/string recognition?
> One easy way to tell: remove all "do somethings" and time.  Remove all
code
> and time.  Compare the two.  If there is only a small percentage
difference
> then you are optimising the wrong code!
>



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

RE: SAX performance observations

Posted by Mark Weaver <ma...@npsl.co.uk>.

For string parsing a good approximation to best parsing can be provided by
using a hash table.  You will only have to do comparisons then in the case
of collisons.  Since your tags are fixed, careful choice of a hash function
and hash table size will mean that they at least don't coincide.  If you can
guarantee that you only ever see these tags (e.g. you have validated the
document against a DTD), then you can construct a perfect hash function,
which will remove all collisions - GNU gperf is good for this.

If you want to go one better then you can construct a DFA to recognise the
strings; DFAs can be generated by lex/flex (which you will have to hack on
quite a lot to support the unicode chars), or if you are feeling brave :-)
you can construct them by hand.  Basically DFAs cut out comparison and
reduce everything to table lookups (speed gained at the expense of code
bloat).  I got a good speed increase in a text parser of ~3X by recoding it
as a DFA.

However - before you go overboard with this, are you sure that your parsing
speed problems still lie in the startElement handler/string recognition?
One easy way to tell: remove all "do somethings" and time.  Remove all code
and time.  Compare the two.  If there is only a small percentage difference
then you are optimising the wrong code!

Mark

> -----Original Message-----
> From: Dee Jay Randall [mailto:randal@circularreasoning.com]
> Sent: 04 October 2001 00:54
> To: xerces-c-dev@xml.apache.org
> Subject: SAX performance observations
>
>
>
>   Two observations on performance implementing a class derived
> from DefaultHandler:
>
>
> 1. If you use a large conditional for handling different tags, you
> should do the comparisons in the order of most likely to be seen.
>
> element distribution: Tag1 (10%); Tag2 (70%); Tag3 (20%)
>
> Then:
>
> void startElement(...)
> {
>   if ( DOMString("Tag2").equals(localname) )
>     ; // do something
>   else if ( DOMString("Tag3").equals(localname) )
>     ; // do something
>   else if ( DOMString("Tag1").equals(localname) )
>     ; // do something
> }
>
>   In my application (with 18 tags), this made about a 3 X performance
> increase.
>
>   Also, it should save string comparisons if all the tags were distinct
> by their first character, but I don't seriously recommend this.
>
>
>
>
> 2. It appears that it is very expensive to do a DOMString("MyTag").
> Basically don't ever do this:
>
> void startElement(...)
> {
>   if ( DOMString("MyTag").equals(localname) )
>     ; // do something
> }
>
>
> Instead, do this:
>
> void startElement(...)
> {
>   static const DOMString cDOMStringMyTag("MyTag");
>
>   if ( cDOMStringMyTag.equals(localname) )
>     ; // do something
> }
>
>   In my application this caused about a 7 X performance increase on
> top of the previous 3 X increase. I went from reading 100K elements
> in 240 seconds to about 11 seconds.
>
>
>
>   Can anyone offer any comments on further accelerating my parsing?
>
>
>   Thanks,
>   Dee Jay
>
> +-----------------------------+------------------+-----------------------+
> | Founding Partner            | Software Engineer| Dee Jay Randall, B.Sc.|
> | Circular Reasoning          | Accrue Software  | M.Sc. Student, CS     |
> | randal@circularreasoning.com| www.accrue.com   | ICQ # 43551676        |
> +-----------------------------+------------------+-----------------------+
> What is the average rank of every song ever written? 42  -- www.launch.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-c-dev-help@xml.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org