You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@abdera.apache.org by Dan Diephouse <da...@mulesource.com> on 2007/10/09 18:56:49 UTC

Understanding Incremental Parsing [was Re: failing parser test]

Was wondering if someone could answer a quick question on the 
incremental parsing business just so I can be sure I fully get things. 
As I understand most parts of the abdera model (at least the impl) are 
built on an Axiom OMElementImpl. As far as incremental parsing is 
concerned, the thing that this is buying Abdera is that Axiom can 
discard nodes later on right? i.e. I can read entry 1 than move to entry 
2 and entry 1 will leave memory? If so, how is that turned on?

- Dan


James M Snell wrote:
> Forcing a clone is the wrong thing to do, but we could introduce a
> method that would force the parse to complete without creating a bunch
> of duplicate objects. FWIW, that could be done today by calling
> toString() rather than clone.
>
> - James
>
> Ugo Cei wrote:
>   
>> On Oct 8, 2007, at 9:10 PM, Dan Diephouse wrote:
>>
>>     
>>> I think this test should be disabled for now. I don't think its good
>>> policy to just leave a failing test in the build. The build should
>>> *always* build and *always* run the tests IMO.  The issue can just be
>>> marked as a blocker for the release and revisited when time/priorities
>>> permit. As a user and developer its very frustrating to find a build
>>> that doesn't work (like the maven build in abdera currently).
>>>       
>> I am always fighting with myself over issues like this one, but in this
>> case I think you are right, so I've put the workaround in place to make
>> the test succeed.
>>
>> I also agree with Garrett that this should be considered a bug: it's
>> just too easy for users to fall into it and bang their head against a
>> wall for a few hours before they realize this is the way the code is
>> actually supposed to work and implementing the workaround in their own
>> code.
>>
>> OTOH, I don't know how easy this would be to fix: maybe by keeping track
>> of partially-parsed documents and calling clone() internally when a
>> modification attempt is detected? Sounds messy.
>>
>>     Ugo
>>
>>
>>     


-- 
Dan Diephouse
MuleSource
http://mulesource.com | http://netzooid.com/blog

Re: Understanding Incremental Parsing [was Re: failing parser test]

Posted by James M Snell <ja...@gmail.com>.


Dan Diephouse wrote:
> James M Snell wrote:
>> The incremental parser model assures that only the objects we actually
>> need will be loaded into memory.  A better way to put it would be
>> parse-on-demand.  Think of it as a hybrid between the SAX and DOM
>> approaches.  The main advantage of this approach is that is uses
>> significantly less memory than DOM.  
> For times when you're reading only the first part of the document I can
> see when this would result in less memory and quicker access times. But
> for someone who needs to access most of the document - i.e. scan through
> the entries in the feed - the whole document will still need to be
> scanned/parsed, so that shouldn't result in any difference in
> memory/time over the normal DOM approach. That is, still an
> OMElementImpl will be created at some point each and every element. And
> each OMElement will stay have attributes, child elements, etc associated
> with it.
> 
> For instance -
> http://www.ibm.com/developerworks/webservices/library/ws-java2/. I think
> the Axiom numbers have probably improved to more JDOM/DOM4j levels since
> then, but still it shows that given equivalent documents which are
> eventually read/loaded into memory, it will have the same order of
> magnitude memory characteristics as anything else out there.
> 

True, but even when fully parsing a document, because of the way Axiom
is implemented, we still realize a significant memory and speed
improvement when working with the full document.  I'd encourage you to
run some of the numbers yourself.


> Or am I missing something here? Abdera doesn't just skip over elements
> which aren't accessed sequentially does it? Or are you saying that the
> benefit is just when you don't need to access the whole document? i.e.
> just read the feed metadata and not the entries?

Abdera only consumes the stream when it's absolutely necessary to do so.
 Elements are not skipped over unless there is a ParseFilter in place
telling it to do so.

If I have a Feed with 100 entries, and all I do is feed.getTitle(), the
100 entries will never be parsed.  Because Atom requires that the
entries come after the rest of the feed level elements, I can read all
of the feed metadata without ever having to parse the individual elements.

When I call feed.getEntries(), Abdera returns a special List
implementation that uses an internal iterator.  That iterator will
incrementally parse the stream, so if I do for (Entry entry :
feed.getEntries()), each loop will incrementally parse the stream;
however, if I do for (int n = 0; n < feed.getEntries().size(); n++), the
call to size() will result in the entire stream being consumed in order
to respond with the correct number of entries.

>> Another advantage is that is means
>> we can introduce filters into the parsing process so that unwanted
>> elements are ignored completely (that's the ParseFilter stuff you see in
>> the core).  To illustrate the difference, a while back we used ROME
>> (which uses JDOM) to parse Tim Bray's Atom feed and output just titles
>> and links to System.out.  We used Abdera with a parse filter to do the
>> exact same test.  The JDOM approach used over 6MB of memory; the Abdera
>> approach used right around ~700 kb of memory.  The Abdera approach was
>> significantly faster as well.
>>
>>   
> Were you skipping all the elements except for the titles? If so, a more
> fair comparison would've implemented a stax/sax filter for JDOM as well.
> Also, not sure what parser you used for JDOM, but Woodstox is 1.5-10x
> faster than the standard SAX parsers IIRC so that may have been a factor.
> 

The test was based on the interfaces that ROME exposed at the time.
>>From what I recall, there was not a way for us to plug in any kind of
parse filter.  We could have just missed it, however.

- James

> - Dan
>

Re: Understanding Incremental Parsing [was Re: failing parser test]

Posted by Dan Diephouse <da...@mulesource.com>.

James M Snell wrote:
> The incremental parser model assures that only the objects we actually
> need will be loaded into memory.  A better way to put it would be
> parse-on-demand.  Think of it as a hybrid between the SAX and DOM
> approaches.  The main advantage of this approach is that is uses
> significantly less memory than DOM.  
For times when you're reading only the first part of the document I can 
see when this would result in less memory and quicker access times. But 
for someone who needs to access most of the document - i.e. scan through 
the entries in the feed - the whole document will still need to be 
scanned/parsed, so that shouldn't result in any difference in 
memory/time over the normal DOM approach. That is, still an 
OMElementImpl will be created at some point each and every element. And 
each OMElement will stay have attributes, child elements, etc associated 
with it.

For instance - 
http://www.ibm.com/developerworks/webservices/library/ws-java2/. I think 
the Axiom numbers have probably improved to more JDOM/DOM4j levels since 
then, but still it shows that given equivalent documents which are 
eventually read/loaded into memory, it will have the same order of 
magnitude memory characteristics as anything else out there.

Or am I missing something here? Abdera doesn't just skip over elements 
which aren't accessed sequentially does it? Or are you saying that the 
benefit is just when you don't need to access the whole document? i.e. 
just read the feed metadata and not the entries?
> Another advantage is that is means
> we can introduce filters into the parsing process so that unwanted
> elements are ignored completely (that's the ParseFilter stuff you see in
> the core).  To illustrate the difference, a while back we used ROME
> (which uses JDOM) to parse Tim Bray's Atom feed and output just titles
> and links to System.out.  We used Abdera with a parse filter to do the
> exact same test.  The JDOM approach used over 6MB of memory; the Abdera
> approach used right around ~700 kb of memory.  The Abdera approach was
> significantly faster as well.
>
>   
Were you skipping all the elements except for the titles? If so, a more 
fair comparison would've implemented a stax/sax filter for JDOM as well. 
Also, not sure what parser you used for JDOM, but Woodstox is 1.5-10x 
faster than the standard SAX parsers IIRC so that may have been a factor.

- Dan

-- 
Dan Diephouse
MuleSource
http://mulesource.com | http://netzooid.com/blog

Re: Understanding Incremental Parsing [was Re: failing parser test]

Posted by James M Snell <ja...@gmail.com>.

The incremental parser model assures that only the objects we actually
need will be loaded into memory.  A better way to put it would be
parse-on-demand.  Think of it as a hybrid between the SAX and DOM
approaches.  The main advantage of this approach is that is uses
significantly less memory than DOM.  Another advantage is that is means
we can introduce filters into the parsing process so that unwanted
elements are ignored completely (that's the ParseFilter stuff you see in
the core).  To illustrate the difference, a while back we used ROME
(which uses JDOM) to parse Tim Bray's Atom feed and output just titles
and links to System.out.  We used Abdera with a parse filter to do the
exact same test.  The JDOM approach used over 6MB of memory; the Abdera
approach used right around ~700 kb of memory.  The Abdera approach was
significantly faster as well.

- James

Dan Diephouse wrote:
> Was wondering if someone could answer a quick question on the
> incremental parsing business just so I can be sure I fully get things.
> As I understand most parts of the abdera model (at least the impl) are
> built on an Axiom OMElementImpl. As far as incremental parsing is
> concerned, the thing that this is buying Abdera is that Axiom can
> discard nodes later on right? i.e. I can read entry 1 than move to entry
> 2 and entry 1 will leave memory? If so, how is that turned on?
> 
> - Dan
> 
> 
> James M Snell wrote:
>> Forcing a clone is the wrong thing to do, but we could introduce a
>> method that would force the parse to complete without creating a bunch
>> of duplicate objects. FWIW, that could be done today by calling
>> toString() rather than clone.
>>
>> - James
>>
>> Ugo Cei wrote:
>>  
>>> On Oct 8, 2007, at 9:10 PM, Dan Diephouse wrote:
>>>
>>>    
>>>> I think this test should be disabled for now. I don't think its good
>>>> policy to just leave a failing test in the build. The build should
>>>> *always* build and *always* run the tests IMO.  The issue can just be
>>>> marked as a blocker for the release and revisited when time/priorities
>>>> permit. As a user and developer its very frustrating to find a build
>>>> that doesn't work (like the maven build in abdera currently).
>>>>       
>>> I am always fighting with myself over issues like this one, but in this
>>> case I think you are right, so I've put the workaround in place to make
>>> the test succeed.
>>>
>>> I also agree with Garrett that this should be considered a bug: it's
>>> just too easy for users to fall into it and bang their head against a
>>> wall for a few hours before they realize this is the way the code is
>>> actually supposed to work and implementing the workaround in their own
>>> code.
>>>
>>> OTOH, I don't know how easy this would be to fix: maybe by keeping track
>>> of partially-parsed documents and calling clone() internally when a
>>> modification attempt is detected? Sounds messy.
>>>
>>>     Ugo
>>>
>>>
>>>     
> 
>