You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Rahul Srivastava <Ra...@Sun.COM> on 2002/05/03 23:03:00 UTC

[xerces2] Measuring performance and optimization

Hi folks,

It has been long talking about improving the performance of Xerces2. There has 
been some benchmarking done earlier, for instance the one done by Dennis 
Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html . These 
results are important to know how fast/slow xerces is as compared to other 
parsers. But, we need to identify areas of improvement in xerces. We need to 
calculate the time taken by each individual component in the pipeline and figure 
out which component swallows how much time for various events and then we can 
actually concentrate on improving performance for those areas. So, here is what 
we plan to do:

+ sax parsing
  - time taken
+ dom parsing
  - dom construction time
  - dom traversal time
  - memory consumed
  - considering the feature deferred-dom as true/false for all of above
+ DTD validation
  - one time parse, time taken
  - multiple times parse using same instance, time taken for second parse onwards
+ Schema validation
  - one time parse, time taken
  - multiple times parse using same instance, time taken for second parse onwards
+ optimising the pipeline
  - calculate pipeline/component initialization time.
  - calculating the time each component in the pipeline takes to propagate
    the event.
  - Using configurations to set up an optimised pipeline for various cases
    such as novalidation, DTD validation only, etc. and calculate the 
    time taken. 

Apart from this should we consider the existing grammar caching framework to 
evaluate the performance of the parser?

We have classified the inputs to be used for this testing as follows:

+ instance docs used
  - tag centric (more tags and small content say 10-50 bytes)
      Type      Tags#
    -------------------
    * small     5-50   
    * medium    50-500
    * large     >500  
    
  - content centric (less tags say 5-10 and huge content)
      Type      content b/w a pair of tag
    -------------------------------------
    * small     500 kb
    * medium    500-5000 kb
    * large     >5000 kb

We can also have depth of the tags as a criteria for the above cases.

Actually speaking, there can be enormous combinations and different figures in 
the above table that reflect the real word instance docs used. I would like to 
know the view of the community here. Is this data enough to evaluate the 
performance of the parser. Is there any data which is publicly available and can 
be used for performance evaluation?.

+ DTD's used
  - should use different types of entities
  
+ XMLSchema's used
  - should use most of the elements and datatypes
  
Will it really help in any way?

Any comments or suggestions appreciated.

Thanks,
Rahul.


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [xerces2] Measuring performance and optimization

Posted by "Theodore W. Leung" <tw...@sauria.com>.

Rahul,

Thanks for reviving this topic.

Tuning Xerces is going to be an iterative process.  We need some test
data that everyone can use, and we need a test driver that everyone can
use.

I'm fine with the metrics and characterization of test data that you are
proposing in your message.  I think it's a great start

I'd also like to propose that all the people working on this check the
test data and the test classes into the build, so that anyone can run
the performance timings for themselves.  (I'd like to see this for the
full test suite as well, but that's another message).  

I have some time that I can contribute towards this effort.  

Ted

On Fri, 2002-05-03 at 14:03, Rahul Srivastava wrote:
> 
> Hi folks,
> 
> It has been long talking about improving the performance of Xerces2. There has 
> been some benchmarking done earlier, for instance the one done by Dennis 
> Sosnoski, see: http://www.sosnoski.com/opensrc/xmlbench/index.html . These 
> results are important to know how fast/slow xerces is as compared to other 
> parsers. But, we need to identify areas of improvement in xerces. We need to 
> calculate the time taken by each individual component in the pipeline and figure 
> out which component swallows how much time for various events and then we can 
> actually concentrate on improving performance for those areas. So, here is what 
> we plan to do:
> 
> + sax parsing
>   - time taken
> + dom parsing
>   - dom construction time
>   - dom traversal time
>   - memory consumed
>   - considering the feature deferred-dom as true/false for all of above
> + DTD validation
>   - one time parse, time taken
>   - multiple times parse using same instance, time taken for second parse onwards
> + Schema validation
>   - one time parse, time taken
>   - multiple times parse using same instance, time taken for second parse onwards
> + optimising the pipeline
>   - calculate pipeline/component initialization time.
>   - calculating the time each component in the pipeline takes to propagate
>     the event.
>   - Using configurations to set up an optimised pipeline for various cases
>     such as novalidation, DTD validation only, etc. and calculate the 
>     time taken. 
> 
> Apart from this should we consider the existing grammar caching framework to 
> evaluate the performance of the parser?
> 
> We have classified the inputs to be used for this testing as follows:
> 
> + instance docs used
>   - tag centric (more tags and small content say 10-50 bytes)
>       Type      Tags#
>     -------------------
>     * small     5-50   
>     * medium    50-500
>     * large     >500  
>     
>   - content centric (less tags say 5-10 and huge content)
>       Type      content b/w a pair of tag
>     -------------------------------------
>     * small     500 kb
>     * medium    500-5000 kb
>     * large     >5000 kb
> 
> We can also have depth of the tags as a criteria for the above cases.
> 
> Actually speaking, there can be enormous combinations and different figures in 
> the above table that reflect the real word instance docs used. I would like to 
> know the view of the community here. Is this data enough to evaluate the 
> performance of the parser. Is there any data which is publicly available and can 
> be used for performance evaluation?.
> 
> + DTD's used
>   - should use different types of entities
>   
> + XMLSchema's used
>   - should use most of the elements and datatypes
>   
> Will it really help in any way?
> 
> Any comments or suggestions appreciated.
> 
> Thanks,
> Rahul.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org