You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ofbiz.apache.org by Adrian Crum <ad...@yahoo.com> on 2009/04/26 05:48:24 UTC

Discussion: XML file parsing improvement

OFBiz uses a lot of XML files. When each XML file is read, it is first parsed into a DOM Document, then the DOM Document is parsed into OFBiz Java objects. This two-step process consumes a lot of memory, and it takes more time than it should.

There is an alternative - what is called event-driven parsing. The XML parser can be set up to convert XML elements directly to the OFBiz Java objects - bypassing the DOM Document build and parse steps. Theoretically, this could provide a huge performance boost, and it would use less memory. In addition, it would solve the problem of huge XML files maxing out server memory during the parse process - like with entity XML import/export.

Has anyone else considered this? Do you think it is worth pursuing?

-Adrian



      

Re: Discussion: XML file parsing improvement

Posted by Jacques Le Roux <ja...@les7arts.com>.
Hi Adrian,

I have not time at the moment, but yes a page in the Wiki with a link to a Jira issue sounds good

Thanks

Jacques

From: "Adrian Crum" <ad...@yahoo.com>
>
> Okay, I did some work on this purely as a learning experience for me. I wanted to learn SAX parsing, so I tried converting the 
> screen widgets to SAX parsing.
>
> I found a small public-domain framework that makes the whole process very easy. Since all of the model screen widgets subclass a 
> single base class, I was able to hook them into the parsing framework by just having the base class subclass one of the framework 
> classes. Model widgets that don't have sub-widgets just needed a new constructor. Model widgets that have sub-widgets needed a 
> little extra code to handle the sub-widgets, but it was no more code than what already exists to handle the DOM version of the 
> sub-widgets.
>
> Overall, it was pretty easy and I was surprised when it worked the very first time I tried it.
>
> If anyone is interested, I would be happy to post the POC code in Jira. Just let me know.
>
> -Adrian
>
>
>
> --- On Sat, 4/25/09, Adrian Crum <ad...@yahoo.com> wrote:
>
>> From: Adrian Crum <ad...@yahoo.com>
>> Subject: Re: Discussion: XML file parsing improvement
>> To: dev@ofbiz.apache.org
>> Date: Saturday, April 25, 2009, 10:28 PM
>> Adam and David,
>>
>> Thank you for your comments! I'll look into the entity
>> import code some more.
>>
>> Personally, I don't have an issue with importing large
>> XML files. I see it come up from time to time on the mailing
>> lists. I remember BJ Freeman had to write his own import
>> code because of some OFBiz limitation.
>>
>> I'll accept the widget files, scripts, and config files
>> are too small to optimize. Having event-driven parsing for
>> those might be an interesting experiment though.
>>
>> -Adrian
>>
>>
>> --- On Sat, 4/25/09, Adam Heath
>> <do...@brainfood.com> wrote:
>>
>> > From: Adam Heath <do...@brainfood.com>
>> > Subject: Re: Discussion: XML file parsing improvement
>> > To: dev@ofbiz.apache.org
>> > Date: Saturday, April 25, 2009, 9:52 PM
>> > Adrian Crum wrote:
>> > > OFBiz uses a lot of XML files. When each XML file
>> is
>> > > read, it is first parsed into a DOM Document,
>> then the
>> > > DOM Document is parsed into OFBiz Java objects.
>> This
>> > > two-step process consumes a lot of memory, and it
>> > > takes more time than it should.
>> > >
>> > > There is an alternative - what is called
>> event-driven
>> > > parsing. The XML parser can be set up to convert
>> XML
>> > > elements directly to the OFBiz Java objects -
>> > > bypassing the DOM Document build and parse steps.
>> > > Theoretically, this could provide a huge
>> performance
>> > > boost, and it would use less memory. In addition,
>> it
>> > > would solve the problem of huge XML files maxing
>> out
>> > > server memory during the parse process - like
>> with
>> > entity XML
>> > import/export.
>> > >
>> > > Has anyone else considered this? Do you think it
>> is
>> > > worth pursuing?
>> >
>> > What files are you talking about, that are so huge,
>> they
>> > can't be
>> > parsed with the simpler DOM model?
>> >
>> > entity data files are sax based already.
>> >
>> > widget files, scripts, config files are small, so
>> it's
>> > better to keep
>> > the simpler algo, as David suggested.
>> >
>> > additionally, I already did some memory profiling a
>> while
>> > back, and
>> > interned the long-lived strings from parsed xml.  This
>> > actually
>> > reduced memory usage.
>> >
>> > Another thing, the widgets, scripts, config files are
>> read
>> > very
>> > infrequently, then cached.  The time it takes to parse
>> them
>> > is not
>> > really a performance consideration.
>> >
>> > As an aside, how much swap do you have on your server?
>>
>> > Any?  Is it
>> > being used?  Then you don't have enough ram.  If
>> your
>> > work-load is
>> > causing swap to be used, then you haven't
>> correctly
>> > identified your
>> > work load usage requirements.
>> >
>> > The same can be said for java maximum memory
>> allocation.
>
>
>
> 



Re: Discussion: XML file parsing improvement

Posted by Adrian Crum <ad...@yahoo.com>.
Okay, I did some work on this purely as a learning experience for me. I wanted to learn SAX parsing, so I tried converting the screen widgets to SAX parsing.

I found a small public-domain framework that makes the whole process very easy. Since all of the model screen widgets subclass a single base class, I was able to hook them into the parsing framework by just having the base class subclass one of the framework classes. Model widgets that don't have sub-widgets just needed a new constructor. Model widgets that have sub-widgets needed a little extra code to handle the sub-widgets, but it was no more code than what already exists to handle the DOM version of the sub-widgets.

Overall, it was pretty easy and I was surprised when it worked the very first time I tried it.

If anyone is interested, I would be happy to post the POC code in Jira. Just let me know.

-Adrian



--- On Sat, 4/25/09, Adrian Crum <ad...@yahoo.com> wrote:

> From: Adrian Crum <ad...@yahoo.com>
> Subject: Re: Discussion: XML file parsing improvement
> To: dev@ofbiz.apache.org
> Date: Saturday, April 25, 2009, 10:28 PM
> Adam and David,
> 
> Thank you for your comments! I'll look into the entity
> import code some more.
> 
> Personally, I don't have an issue with importing large
> XML files. I see it come up from time to time on the mailing
> lists. I remember BJ Freeman had to write his own import
> code because of some OFBiz limitation.
> 
> I'll accept the widget files, scripts, and config files
> are too small to optimize. Having event-driven parsing for
> those might be an interesting experiment though.
> 
> -Adrian
> 
> 
> --- On Sat, 4/25/09, Adam Heath
> <do...@brainfood.com> wrote:
> 
> > From: Adam Heath <do...@brainfood.com>
> > Subject: Re: Discussion: XML file parsing improvement
> > To: dev@ofbiz.apache.org
> > Date: Saturday, April 25, 2009, 9:52 PM
> > Adrian Crum wrote:
> > > OFBiz uses a lot of XML files. When each XML file
> is
> > > read, it is first parsed into a DOM Document,
> then the
> > > DOM Document is parsed into OFBiz Java objects.
> This
> > > two-step process consumes a lot of memory, and it
> > > takes more time than it should.
> > > 
> > > There is an alternative - what is called
> event-driven
> > > parsing. The XML parser can be set up to convert
> XML
> > > elements directly to the OFBiz Java objects -
> > > bypassing the DOM Document build and parse steps.
> > > Theoretically, this could provide a huge
> performance
> > > boost, and it would use less memory. In addition,
> it
> > > would solve the problem of huge XML files maxing
> out
> > > server memory during the parse process - like
> with
> > entity XML
> > import/export.
> > > 
> > > Has anyone else considered this? Do you think it
> is
> > > worth pursuing?
> > 
> > What files are you talking about, that are so huge,
> they
> > can't be
> > parsed with the simpler DOM model?
> > 
> > entity data files are sax based already.
> > 
> > widget files, scripts, config files are small, so
> it's
> > better to keep
> > the simpler algo, as David suggested.
> > 
> > additionally, I already did some memory profiling a
> while
> > back, and
> > interned the long-lived strings from parsed xml.  This
> > actually
> > reduced memory usage.
> > 
> > Another thing, the widgets, scripts, config files are
> read
> > very
> > infrequently, then cached.  The time it takes to parse
> them
> > is not
> > really a performance consideration.
> > 
> > As an aside, how much swap do you have on your server?
> 
> > Any?  Is it
> > being used?  Then you don't have enough ram.  If
> your
> > work-load is
> > causing swap to be used, then you haven't
> correctly
> > identified your
> > work load usage requirements.
> > 
> > The same can be said for java maximum memory
> allocation.


      

Re: Discussion: XML file parsing improvement

Posted by Adrian Crum <ad...@yahoo.com>.
Adam and David,

Thank you for your comments! I'll look into the entity import code some more.

Personally, I don't have an issue with importing large XML files. I see it come up from time to time on the mailing lists. I remember BJ Freeman had to write his own import code because of some OFBiz limitation.

I'll accept the widget files, scripts, and config files are too small to optimize. Having event-driven parsing for those might be an interesting experiment though.

-Adrian


--- On Sat, 4/25/09, Adam Heath <do...@brainfood.com> wrote:

> From: Adam Heath <do...@brainfood.com>
> Subject: Re: Discussion: XML file parsing improvement
> To: dev@ofbiz.apache.org
> Date: Saturday, April 25, 2009, 9:52 PM
> Adrian Crum wrote:
> > OFBiz uses a lot of XML files. When each XML file is
> > read, it is first parsed into a DOM Document, then the
> > DOM Document is parsed into OFBiz Java objects. This
> > two-step process consumes a lot of memory, and it
> > takes more time than it should.
> > 
> > There is an alternative - what is called event-driven
> > parsing. The XML parser can be set up to convert XML
> > elements directly to the OFBiz Java objects -
> > bypassing the DOM Document build and parse steps.
> > Theoretically, this could provide a huge performance
> > boost, and it would use less memory. In addition, it
> > would solve the problem of huge XML files maxing out
> > server memory during the parse process - like with
> entity XML
> import/export.
> > 
> > Has anyone else considered this? Do you think it is
> > worth pursuing?
> 
> What files are you talking about, that are so huge, they
> can't be
> parsed with the simpler DOM model?
> 
> entity data files are sax based already.
> 
> widget files, scripts, config files are small, so it's
> better to keep
> the simpler algo, as David suggested.
> 
> additionally, I already did some memory profiling a while
> back, and
> interned the long-lived strings from parsed xml.  This
> actually
> reduced memory usage.
> 
> Another thing, the widgets, scripts, config files are read
> very
> infrequently, then cached.  The time it takes to parse them
> is not
> really a performance consideration.
> 
> As an aside, how much swap do you have on your server? 
> Any?  Is it
> being used?  Then you don't have enough ram.  If your
> work-load is
> causing swap to be used, then you haven't correctly
> identified your
> work load usage requirements.
> 
> The same can be said for java maximum memory allocation.


      

Re: Discussion: XML file parsing improvement

Posted by Adam Heath <do...@brainfood.com>.
Adrian Crum wrote:
> OFBiz uses a lot of XML files. When each XML file is
> read, it is first parsed into a DOM Document, then the
> DOM Document is parsed into OFBiz Java objects. This
> two-step process consumes a lot of memory, and it
> takes more time than it should.
> 
> There is an alternative - what is called event-driven
> parsing. The XML parser can be set up to convert XML
> elements directly to the OFBiz Java objects -
> bypassing the DOM Document build and parse steps.
> Theoretically, this could provide a huge performance
> boost, and it would use less memory. In addition, it
> would solve the problem of huge XML files maxing out
> server memory during the parse process - like with entity XML
import/export.
> 
> Has anyone else considered this? Do you think it is
> worth pursuing?

What files are you talking about, that are so huge, they can't be
parsed with the simpler DOM model?

entity data files are sax based already.

widget files, scripts, config files are small, so it's better to keep
the simpler algo, as David suggested.

additionally, I already did some memory profiling a while back, and
interned the long-lived strings from parsed xml.  This actually
reduced memory usage.

Another thing, the widgets, scripts, config files are read very
infrequently, then cached.  The time it takes to parse them is not
really a performance consideration.

As an aside, how much swap do you have on your server?  Any?  Is it
being used?  Then you don't have enough ram.  If your work-load is
causing swap to be used, then you haven't correctly identified your
work load usage requirements.

The same can be said for java maximum memory allocation.


Re: Discussion: XML file parsing improvement

Posted by David E Jones <da...@hotwaxmedia.com>.
I'm guessing you're speaking of SAX parsers when you talk about event- 
driven parsing.

If you take a look at the entity XML import code it actually is a SAX  
event-driven parser.

As for other XML readers (like entity defs, widget XML files, etc,  
etc) I'd be surprised if a SAX reader resulting in much performance  
improvement, and it makes the code more complex, so the first step  
would be to test it on one and do some performance tests to see if it  
is any faster. The XML reading code already has some simple stuff to  
test how long it takes, though to test this you should run it 100  
times or something so the times are more meaningful (otherwise they  
are probably less than 1ms and possibly not as accurate on the small  
time scale).

Anyway, yeah, there are some general thoughts about it at least...

-David


On Apr 25, 2009, at 9:48 PM, Adrian Crum wrote:

>
> OFBiz uses a lot of XML files. When each XML file is read, it is  
> first parsed into a DOM Document, then the DOM Document is parsed  
> into OFBiz Java objects. This two-step process consumes a lot of  
> memory, and it takes more time than it should.
>
> There is an alternative - what is called event-driven parsing. The  
> XML parser can be set up to convert XML elements directly to the  
> OFBiz Java objects - bypassing the DOM Document build and parse  
> steps. Theoretically, this could provide a huge performance boost,  
> and it would use less memory. In addition, it would solve the  
> problem of huge XML files maxing out server memory during the parse  
> process - like with entity XML import/export.
>
> Has anyone else considered this? Do you think it is worth pursuing?
>
> -Adrian
>
>
>
>


Re: Discussion: XML file parsing improvement

Posted by Tim Ruppert <ti...@hotwaxmedia.com>.
The outcome of optimizing the memory usage and helping to get past some of these out of memory issues is definitely a big win - but I don't know enough about the event-driven parsing paradigm to speak to the outcome.  I guess my vote would be that it's worth a shot for sure.

Cheers,
Tim
--
Tim Ruppert
HotWax Media
http://www.hotwaxmedia.com

o:801.649.6594
f:801.649.6595

----- "Adrian Crum" <ad...@yahoo.com> wrote:

> OFBiz uses a lot of XML files. When each XML file is read, it is first
> parsed into a DOM Document, then the DOM Document is parsed into OFBiz
> Java objects. This two-step process consumes a lot of memory, and it
> takes more time than it should.
> 
> There is an alternative - what is called event-driven parsing. The XML
> parser can be set up to convert XML elements directly to the OFBiz
> Java objects - bypassing the DOM Document build and parse steps.
> Theoretically, this could provide a huge performance boost, and it
> would use less memory. In addition, it would solve the problem of huge
> XML files maxing out server memory during the parse process - like
> with entity XML import/export.
> 
> Has anyone else considered this? Do you think it is worth pursuing?
> 
> -Adrian