You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-users@xalan.apache.org by David Rohde <dj...@physics.uq.edu.au> on 2004/10/28 09:43:43 UTC

Controlling Memory Usage

Hi I am very new to Xalan, Xerces and XML, but I have a problem that I 
need to solve.  I am hoping that somebody may be able to give me a few 
suggestions which might stop me wading through piles of unfamiliar 
documentation and source code.

I would like to adapt a library that I am using that uses XPath from 
Xalan.  At the moment the library reads very large XML files entirely in 
to memory.  The files that I am using contains either a BLOB or a very 
structured xml table as one of the elements.  I would like to modify the 
library in order to :

Continue using XPATH / XALAN

I would like to prevent the data from part of the tree being loaded in to 
memory.  Is this possible and how is it done?

I would also like to be able to obtain a pointer to the place in the file 
where this tag starts, so I can use a non-DOM method to read this data.

Is this possible?  Are there any suggestions about how I should go about 
doing this?

Thanks for any help.

David Rohde

Re: Controlling Memory Usage

Posted by Don McClimans <dm...@IntiElectronics.com>.

David Rohde wrote:

> Another way to do it might be.  
> 
> 1. Read the file and remove the very large tree branch.  using fopen,
> fread, etc
> 
> 2. Write this to two separate xml files.  One of these files represents
> a header, the other the actual data.
> 
> 3. Continue to use the library on the header, use an alternative
> implementation on the actual data.
> 
> This would work, but is a little ugly.  Is there a better way?

Perhaps you could you read the input xml file using SAX, and create your DOM from the SAX events. Then your SAX handling functions could skip the BLOB or whatever elements you don't want to load. They could convert the blob to a reference -- that is, store a DOM element that gives the filename and byte offset in the file. Or they could write the blob to another file, and store a reference to that file in the DOM.

Don

Re: Controlling Memory Usage

Posted by David Rohde <dj...@itee.uq.edu.au>.

Thanks for the suggestions Dave,

I can see that this might be difficult.

> I'm not sure what you're looking for here.  There's a lot the parser does 
> to take an XML document and turn it into a 
> DOM or SAX events, including whitespace normalization, expanding entities, 
> transcoding, etc.  You could write your own implementation of the Xerces-C 
> DOM abstract DOM, or Xalan-C's abstract DOM, but that's a lot of work. 
> And, in the end, if you XPath expressions refer to the entire tree, you 
> will end up loading the entire tree anyway.
> 


The most important point that I am concerned with at the moment

* Is it possible to tell XPath not to descend part of the tree?

I think you said this was impossible, but I would like to know for sure.

Presumably it is possible if you hack XPath ... not that I want to do
that.



You gave me a few suggestions, but that would still result in a memory
hungry application... Another way to do it might be.  

1. Read the file and remove the very large tree branch.  using fopen,
fread, etc

2. Write this to two separate xml files.  One of these files represents
a header, the other the actual data.

3. Continue to use the library on the header, use an alternative
implementation on the actual data.


This would work, but is a little ugly.  Is there a better way?


Thanks again,

David

Re: Controlling Memory Usage

Posted by da...@us.ibm.com.

> I would like to adapt a library that I am using that uses XPath from 
> Xalan.  At the moment the library reads very large XML files entirely in 

> to memory.  The files that I am using contains either a BLOB or a very 
> structured xml table as one of the elements.

Large files will always consume a great deal of memory, even with the 
default source tree implementation, which is about as efficient as it can 
be.

> I would like to prevent the data from part of the tree being loaded in 
to 
> memory.  Is this possible and how is it done?

Unfortunately, because XPath allows random access to the tree, this is a 
very difficult problem to solve.

> I would also like to be able to obtain a pointer to the place in the 
file 
> where this tag starts, so I can use a non-DOM method to read this data.

I'm not sure what you're looking for here.  There's a lot the parser does 
to take an XML document and turn it into a 
DOM or SAX events, including whitespace normalization, expanding entities, 
transcoding, etc.  You could write your own implementation of the Xerces-C 
DOM abstract DOM, or Xalan-C's abstract DOM, but that's a lot of work. 
And, in the end, if you XPath expressions refer to the entire tree, you 
will end up loading the entire tree anyway.

> Is this possible?  Are there any suggestions about how I should go about 

> doing this?

Yes, but plan on doing _lots_ of work.  Adding several gigabytes of memory 
to your machine will be far cheaper.

There are some undocumented options with Xalan-C's source tree that will 
pool the strings for the text nodes in a document.  If your documents have 
text nodes with lots of repeated values, this can yield _significant_ 
memory savings.  You might want to build a custom version of the library 
with that option enabled to see if that helps memory consumption.

Dave