You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Jimmy Zhang <cr...@comcast.net> on 2006/02/20 02:40:49 UTC

[ANN] VTD-XML Version 1.5 Released

Eight years after the invention of XML, DOM and SAX, 
despite their respective issues, are still the mainstays 
of application developers.  
 
So is it the end of road for XML parsing innovation? 
 
The VTD-XML project team think not. We are proud to 
announce the availability of both C and Java version 
1.5 of VTD-XML, the next generation open-source XML 
parser that goes beyond DOM and SAX in terms of 
performance, memory usage and ease of use. 
 
The technical highlights of VTD-XML are: 

* Performance: the world's fastest XML parser,
  between 5x~10x faster than DOM
* Memory Usage: 3x to 5x less than DOM, 1.3x~1.5x
  XML document size
* Random access with built-in XPath support
* A simple and intuitive API 

Other advanced features include:
* Buffer reuse
* Large document support (2GByte)
* Incremental update
* Hardware acceleration
* Native XML indexing.

For demos, latest benchmarks, related articles and software 
downloads, please visit http://vtd-xml.sf.net. Also let us 
know your thoughts and suggestions and help us improve 
VTD-XML.

Re: [ANN] VTD-XML Version 1.5 Released

Posted by Jimmy Zhang <cr...@comcast.net>.

Random access is defined by DOM, like navigating from an element
to one of child elements, or one of attributes...
Interesting that you sounds quite negative... if I were you, I would
reserve any judgement and try to understand VTD-XML as well
as I can, before making claims that it cuts corners...
Cheers,
Jz

----- Original Message ----- 
From: "Stefano Mazzocchi" <st...@apache.org>
To: <de...@cocoon.apache.org>
Sent: Monday, February 20, 2006 11:21 AM
Subject: Re: [ANN] VTD-XML Version 1.5 Released


> Jimmy Zhang wrote:
>> Hi, Thanks for the email.
>> My answers to your questions:
>> 1. It is a tradeoff-VTD-XMl consumes more memory, but
>> is easy to use and more powerful, Any random access capable XML 
>> processing API *needs* to at least load the entire hierachical structure 
>> in memory. My take is that among SAX, STAX, DOM
>> and JDOM, vtd-xml is the least likely one to choke, and best one
>> to handle peak loads...
>
> whatever
>
> most XSLT cases *NO NOT* need to load the xml in memory to be able to 
> process it. Unless you abuse xsl:sort or xpaths with .., most things can 
> be done with pure event-driven pipeline style, and only a small buffer 
> needs to be kept in memory.
>
> Xalan XSLTC is able to pre-process xslt stylesheets and compile them into 
> code that will know how much buffer to keep because it knows what kind of 
> xpath events will be called on the incoming stream.
>
>> 2. Agree with you, benchmarking a dummy SAX parser is unfair for VTD-XML,
>> that will make VTD-XML look prettier in real life scenario.
>
> whatever #2, playing smartass (and avoiding the issue that I mentioned) is 
> unlikely to make your points more solid.
>
>> 3. Look at all the vertical industry XML related vocubalry,  SOAP,
>> Rest and XML schema, and infoset data model, DTD seems deprecated
>> a bit, and VTD-XMl doesn't support external entities... other than that
>> VTD-XML is equally capable
>
> I agree that DTDs should be deprecated and seem like an SGML vestigial 
> feature.
>
> My point is that it's unfair to compete with a fully compliant xml parser 
> with a parser that knows how to cut corners (and therefore doesn't have to 
> scan the text for entities to expand!).
>
> if xerces was allowed to get away with no need to parse entities and 
> didn't have to create strings, it would be just as fast as yours.
>
> BTW, you have not answered these questions:
>
>>> You claim xpath random access, but what is the algorithmical complexity 
>>> of that? O(1), O(log(n)), O(n), O(n*log(n))? If one were to store the 
>>> parsed tree index on disk, how many pages would one need to page in 
>>> before reaching the required xpath?
>
> -- 
> Stefano.
>
>

Re: [ANN] VTD-XML Version 1.5 Released

Posted by Stefano Mazzocchi <st...@apache.org>.

Jimmy Zhang wrote:
> Hi, Thanks for the email.
> My answers to your questions:
> 1. It is a tradeoff-VTD-XMl consumes more memory, but
> is easy to use and more powerful, Any random access capable XML 
> processing API *needs* to at least load the entire hierachical structure 
> in memory. My take is that among SAX, STAX, DOM
> and JDOM, vtd-xml is the least likely one to choke, and best one
> to handle peak loads...

whatever

most XSLT cases *NO NOT* need to load the xml in memory to be able to 
process it. Unless you abuse xsl:sort or xpaths with .., most things can 
be done with pure event-driven pipeline style, and only a small buffer 
needs to be kept in memory.

Xalan XSLTC is able to pre-process xslt stylesheets and compile them 
into code that will know how much buffer to keep because it knows what 
kind of xpath events will be called on the incoming stream.

> 2. Agree with you, benchmarking a dummy SAX parser is unfair for VTD-XML,
> that will make VTD-XML look prettier in real life scenario.

whatever #2, playing smartass (and avoiding the issue that I mentioned) 
is unlikely to make your points more solid.

> 3. Look at all the vertical industry XML related vocubalry,  SOAP,
> Rest and XML schema, and infoset data model, DTD seems deprecated
> a bit, and VTD-XMl doesn't support external entities... other than that
> VTD-XML is equally capable

I agree that DTDs should be deprecated and seem like an SGML vestigial 
feature.

My point is that it's unfair to compete with a fully compliant xml 
parser with a parser that knows how to cut corners (and therefore 
doesn't have to scan the text for entities to expand!).

if xerces was allowed to get away with no need to parse entities and 
didn't have to create strings, it would be just as fast as yours.

BTW, you have not answered these questions:

>> You claim xpath random access, but what is the algorithmical 
>> complexity of that? O(1), O(log(n)), O(n), O(n*log(n))? If one were to 
>> store the parsed tree index on disk, how many pages would one need to 
>> page in before reaching the required xpath?

-- 
Stefano.

Re: [ANN] VTD-XML Version 1.5 Released

Posted by Jimmy Zhang <cr...@comcast.net>.

Hi, Thanks for the email.
My answers to your questions:
1. It is a tradeoff-VTD-XMl consumes more memory, but
is easy to use and more powerful, Any random access capable 
XML processing API *needs* to at least load the entire hierachical 
structure in memory. My take is that among SAX, STAX, DOM
and JDOM, vtd-xml is the least likely one to choke, and best one
to handle peak loads...
2. Agree with you, benchmarking a dummy SAX parser is unfair for VTD-XML,
that will make VTD-XML look prettier in real life scenario.
3. Look at all the vertical industry XML related vocubalry,  SOAP,
Rest and XML schema, and infoset data model, DTD seems deprecated
a bit, and VTD-XMl doesn't support external entities... other than that
VTD-XML is equally capable 

Cheers,
jz



----- Original Message ----- 
From: "Stefano Mazzocchi" <st...@apache.org>
To: <de...@cocoon.apache.org>
Sent: Sunday, February 19, 2006 8:57 PM
Subject: Re: [ANN] VTD-XML Version 1.5 Released





> 
> Hmmmm, I have to admit that I've toyed with this idea myself lately, 
> especially since I'm diving deep into processing large quantities of XML 
> files these days (when I say 'large', I mean it, large that 32 bits of 
> address space are not enough).
> 
> The idea of non-extracting parsing is nice but there are few issues:
> 
>  1) the memory requirements, still much less than DOM, but are still 
> *way* more than an event-driven model like SAX. Cocoon, for example, 
> would die if we were to move to a parser like this one, especially under 
> load spikes.
> 
>  2) benchmarking against a dummy SAX content handler is completely 
> meaningless. in order for the API to be of any use, you have to create 
> strings, you can't simply pass pointers to char arrays around. I bet 
> that if the SAX parser could go on without creating strings, it would be 
> just as fast (xerces, in fact, does use a similar mechanism to return 
> you the character() SAX event, where the entire document is kept in 
> memory and the start/finish pointers are passed instead of a new array.
> 
>  3) 90% of the slowness comes from 10% of the details in the XML spec, 
> which means in order to keep fast, you need to sacrifice compliance... 
> which is not an option these days given how cheap silicon is.
> 
> But don't get me wrong, I think there is something interesting in what 
> you are doing: I think it would be cool if you could serialize the 'tree 
> index' alongside the document on disk and provide some sort of b-tree 
> indexing for it. It would help me in my multi-GB-of-XML day2day struggle.
> 
> You claim xpath random access, but what is the algorithmical complexity 
> of that? O(1), O(log(n)), O(n), O(n*log(n))? If one were to store the 
> parsed tree index on disk, how many pages would one need to page in 
> before reaching the required xpath?
> 
> -- 
> Stefano.
> 
>

Re: [ANN] VTD-XML Version 1.5 Released

Posted by Stefano Mazzocchi <st...@apache.org>.

Jimmy Zhang wrote:
> Eight years after the invention of XML, DOM and SAX,
> despite their respective issues, are still the mainstays
> of application developers. 
>  
> So is it the end of road for XML parsing innovation?
>  
> The VTD-XML project team think not. We are proud to
> announce the availability of both C and Java version
> 1.5 of VTD-XML, the next generation open-source XML
> parser that goes beyond DOM and SAX in terms of
> performance, memory usage and ease of use.
>  
> The technical highlights of VTD-XML are:
>  
> * Performance: the world's fastest XML parser,
>   between 5x~10x faster than DOM
> * Memory Usage: 3x to 5x less than DOM, 1.3x~1.5x
>   XML document size
> * Random access with built-in XPath support
> * A simple and intuitive API
>  
> Other advanced features include:
> * Buffer reuse
> * Large document support (2GByte)
> * Incremental update
> * Hardware acceleration
> * Native XML indexing.
>  
> For demos, latest benchmarks, related articles and software
> downloads, please visit http://vtd-xml.sf.net. Also let us
> know your thoughts and suggestions and help us improve
> VTD-XML.

Hmmmm, I have to admit that I've toyed with this idea myself lately, 
especially since I'm diving deep into processing large quantities of XML 
files these days (when I say 'large', I mean it, large that 32 bits of 
address space are not enough).

The idea of non-extracting parsing is nice but there are few issues:

  1) the memory requirements, still much less than DOM, but are still 
*way* more than an event-driven model like SAX. Cocoon, for example, 
would die if we were to move to a parser like this one, especially under 
load spikes.

  2) benchmarking against a dummy SAX content handler is completely 
meaningless. in order for the API to be of any use, you have to create 
strings, you can't simply pass pointers to char arrays around. I bet 
that if the SAX parser could go on without creating strings, it would be 
just as fast (xerces, in fact, does use a similar mechanism to return 
you the character() SAX event, where the entire document is kept in 
memory and the start/finish pointers are passed instead of a new array.

  3) 90% of the slowness comes from 10% of the details in the XML spec, 
which means in order to keep fast, you need to sacrifice compliance... 
which is not an option these days given how cheap silicon is.

But don't get me wrong, I think there is something interesting in what 
you are doing: I think it would be cool if you could serialize the 'tree 
index' alongside the document on disk and provide some sort of b-tree 
indexing for it. It would help me in my multi-GB-of-XML day2day struggle.

You claim xpath random access, but what is the algorithmical complexity 
of that? O(1), O(log(n)), O(n), O(n*log(n))? If one were to store the 
parsed tree index on disk, how many pages would one need to page in 
before reaching the required xpath?

-- 
Stefano.